File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1015_metho.xml
Size: 20,160 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1015"> <Title>Handling noisy training and testing data</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction: Errors </SectionTitle> <Paragraph position="0"> Nobody's perfect. A clich e, but in the eld of empirical natural language processing, we know it to be true: on a daily basis, we work with large corpora created by, and often marked up by, humans. Fallible as ever, these humans have made errors. For the errors in content, be they spelling, syntax, or something else, we can hope to build more robust systems that will be able to handle them. But what of the errors in markup? In this paper, we propose a system for cataloguing corpus errors, and discuss some strategies for dealing with them as a research community. Finally, we will present an example (function tagging) that demonstrates the appropriateness of our methods.</Paragraph> </Section> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 An error taxonomy 2.1 Type A: Detectable errors </SectionTitle> <Paragraph position="0"> The easiest errors, which we have dubbed \Type A&quot;, are those that can be automatically detected and xed. These typically come up when there would be multiple reasonable ways of tagging a certain interesting situation: the markup guidelines arbitrarily choose one, and the human annotator unthinkingly uses the other.</Paragraph> <Paragraph position="1"> The canonical example of this sort of thing is the treebank's LGS tag, representing the \logical subject&quot; of a passive construction. It makes a great deal of sense to put this tag on the NP object of the 'by' construction; it makes almost as much sense to tag the PP itself, especially since (given a choice) most other function tags are put there. The tree-bank guidelines speci cally choose the former: \It attaches to the NP object of by and not to the PP node itself.&quot; (Bies et al., 1995) Nevertheless, in several cases the annotators put the tag on the PP,as shown in Figure 1. We can automatically correct this error by algorithmically removing the LGS tag from any such PP and adding it to the object thereof.</Paragraph> <Paragraph position="2"> The unifying feature of all Type A errors is that the annotator's intent is still clear. In the LGS case, the annotator managed to clearly indicate the presence of a passive construction and its logical subject. Since the transformation from what was marked to what ought to have been marked is straightforward and algorithmic, we can easily apply this correction to all data.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.2 Type B: Fixable errors </SectionTitle> <Paragraph position="0"> Next, we come to the Type B errors, those which are xable but require human intervention at some point in the process. In theory, this category could include errors that could be found automatically but require a human to x; this doesn't happen in practice, because if an error is su ciently systematic that an algorithm can detect it and be certain that it is in fact an error, it can usually be corrected with certainty as well. In practice, the instances of this class of error are all cases where the computer can't detect the error for certain. However, for all Type B errors, once detected, the correction that needs to be made is clear, at least to a human observer with access to the annotation guidelines.</Paragraph> <Paragraph position="1"> Certain Type B errors are moderately easy to nd. When annotators misunderstand a complicated markup guideline, they mismark in a somewhat predictable way. While not being totally systematically detectable, an algorithm can leverage these patterns to extract a list of tags or parses that might be incorrect, which a human can then examine. Some errors of this type (henceforth \Type B &quot;) include: VBD / VBN. Often the past tense form of a verb (VBD) and its past participle (VBN)havethe same form, and thus annotators sometimes mis null take one for the other, as in Figure 2. Some such cases are not detectable, which is why this is not</Paragraph> <Paragraph position="3"> lines for telling these three things apart, but frequently a preposition (IN) is marked when an adverb (RB) or particle (PRT) would be more appropriate. If an IN is occurring somewhere other than under a PP, it is likely to be a mistag.</Paragraph> <Paragraph position="4"> Occasionally, an extracted list of maybe-errors will be \perfect&quot;, containing only instances that are actually corpus errors. This happens when the pattern is a very good heuristic, though not necessarily valid (which is why the errors are Type B , and not Type A). When ling corrections for these, it is still best to annotate them individually, as the corrections may later be applied to an expanded or modi- null There is a subclass of this error which is Type A: when we nd a VBD whose grandparent is a VP headed by a form of 'have', we can deterministically retag it as VBN.</Paragraph> <Paragraph position="5"> ed data set, for which the heuristic would no longer be perfect.</Paragraph> <Paragraph position="6"> Other xable errors are pretty much isolated.</Paragraph> <Paragraph position="7"> Within section 24 of the treebank, for instance, we have: the word 'long' tagged as an adjective (JJ)when clearly used as a verb (VB) the word 'that' parsed into a noun phrase instead of heading a subordinate clause, as in Figure 3 a phrase headed by 'about',asin'think about', tagged as a location (LOC) These isolated errors (resulting, presumably, from a typo or a moment of inattention on the part of the annotator) are not in any way predictable, and can be found essentially only by examining the output of one's algorithm, analysing the \errors&quot;, and noticing that the treebank was incorrect, rather than (or in addition to) the algorithm. We will call these</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.3 Type C: Systematic inconsistency </SectionTitle> <Paragraph position="0"> Sometimes, there is a construction that the markup guidelines writers didn't think about, didn't write up, or weren't clear about. In these cases, annotators are left to rely on their own separate intuitions. This leaves us with markup that is inconsistent and therefore clearly partially in error, but with no obvious correction. There is really very little to be done about these, aside from noting them and perhaps controlling for them in the evaluation.</Paragraph> <Paragraph position="1"> Some Type C errors in the treebank include: 'ago'. English's sole postposition seems to have given annotators some di culty. Lacking a postposition tag, many tagged such occurrences of 'ago' as a preposition (IN); others used the adverb tag (RB) exclusively.</Paragraph> <Paragraph position="2"> Since some occurrences really are adverbs, this just makes a big mess.</Paragraph> <Paragraph position="3"> ADVP-MNR.TheMNR tag is meant to be applied to constituents denoting manner or instrument. Some annotators (but not all) seemed to decide that any adverbial phrase (ADVP) headed by an '-ly' word must get a MNR tag, applying it to words like 'suddenly', 'signi cantly', and 'clearly'.</Paragraph> <Paragraph position="4"> In particular, the annotators of sections 05, 09, 12, 17, 20, and 24 used IN sometimes, while the others tagged all occurrences of 'ago' as adverbs.</Paragraph> <Paragraph position="6"> The hallmark of a Type C error is that even what ought to be correct isn't always clear, and as a result, any plan to correct a group of Type C errors will have to rst include discussion on what the correct markup guideline should be.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="4" type="metho"> <SectionTitle> 3 tsed </SectionTitle> <Paragraph position="0"> In order to e ect these changes in some communicable way, we have implemented a program called tsed, by analogy with and inspired by the already prevalent tgrep search program.</Paragraph> <Paragraph position="1"> It takes a search pattern and a replacement pattern, and after nding the constituent(s) that match the search pattern, modi es them and prints the result. For those already familiar with tgrep search syntax, this should be moderately intuitive.</Paragraph> <Paragraph position="2"> To the basic pattern-matching syntax of tgrep,we have added a few extra restriction patterns (for specifying sentence number and head word), as well as a way of marking nodes for later reference in the replacement pattern (by simply wrapping a constituent in square brackets instead of parentheses).</Paragraph> <Paragraph position="3"> The replacement syntax is somewhat more complicated, because wherever possible we want to be able to construct the new trees by reference to the old tree, in order to preserve modi ers and structure we may not know about when we write the pattern.</Paragraph> <Paragraph position="4"> For full details of the program's abilities, consult the program documentation, but here are the main ones: Relabelling. Constituents can be relabelled with no change to any of their modi ers or children.</Paragraph> <Paragraph position="5"> Tagging. A tag can be added to or removed from a constituent, without changing any modi ers or children.</Paragraph> <Paragraph position="6"> Reference. Constituents in the search pattern can be included by reference in the replacement pattern.</Paragraph> <Paragraph position="7"> tgrep was written by Richard Pito of the University of Pennsylvania, and comes with the treebank.</Paragraph> <Paragraph position="8"> Construction. New structure can be built by specifying it in the usual S-expression format, e.g. (NP (NN snork)). Usually used in combination with Reference patterns.</Paragraph> <Paragraph position="9"> Along with tsed itself, we distribute a Perl program wsjsed to process treebank change scripts like the following:</Paragraph> <Paragraph position="11"> This script would make a batch modi cation to the zeroth sentence of the 29th le in section 24. The batch includes two corrections: the rst matches a noun phrase (NP) whose sister is an ADJP and whose parent is a VP headed by the word 'keep'.The matched NP node is replaced by a (created) S node whose children will be that very NP and its sister ADJP. The second correction then nds an NP that ends in the word 'markets' and marks it with the SBJ function tag.</Paragraph> <Paragraph position="12"> Distributing changes in this form is important for two reasons. First of all, by giving changes in their minimal, most general forms, they are small and easy to transmit, and easy to merge. Perhaps more importantly, since corpora are usually copyrighted and can only be used by paying a fee to the controlling body (usually LDC or ELDA), we need a way to distribute only the changes, in a form that is useless without having bought the original corpus. Scripts for tsed,orforwsjsed, serve this purpose.</Paragraph> <Paragraph position="13"> These programs are available from our website.</Paragraph> </Section> <Section position="5" start_page="4" end_page="8" type="metho"> <SectionTitle> 4 When to correct </SectionTitle> <Paragraph position="0"> Now that we have analysed the di erent types of errors that can occur and how to correct them, we can discuss when and whether to do so.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.1 Training </SectionTitle> <Paragraph position="0"> In virtually all empirical NLP work, the training set is going to encompass the vast majority of the data.</Paragraph> <Paragraph position="1"> As such, it is usually impractical for a human (or even a whole lab of humans) to sit down and revise the training. Type A errors can be corrected easily enough, as can some Type B errors whose heuristics have a high yield. Purely on grounds of practicality, though, it would be di cult to e ect signi cant correction on a training set of any signi cant size (such as for the treebank).</Paragraph> <Paragraph position="2"> Practicality aside, correcting the training set is a bad idea anyway. After expending an enormous effort to perfect one training set, the net result is just one correct training set. While it might make certain things easier and probably will improve the results of most algorithms, those improved results will not be valid for those same algorithms trained on other, non-perfect data; the vast majority of corpora will still be noisy. If a user of an algorithm, e.g. an application developer, chooses to perfect a training set to improve the results, that would be helpful, but it is important that researchers report results that are likely to be applicable more generally, to more than one training set. Furthermore, robustness to errors in the training, via smoothing or some other mechanism, will also make an algorithm robust to sparse data, the ever-present spectre that haunts nearly every problem in the eld; thus eliminating all errors in the training ought not to have as much of an e ect on a strong algorithm.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 4.2 Testing </SectionTitle> <Paragraph position="0"> Testing data is another story, however. In terms of practicality, it is more feasible, as the test set is usually at least an order of magnitude smaller than the training. More important, though, is the issue of fairness. We need to continue using noisy training data in order to better model real-world use, but it is unfair and unreasonable to have noise in the gold standard , which causes an algorithm to be penalised where it is more correct than the human annotation. As performance on various tasks improves, it becomes ever more important to be able to correct the testing data. A 'mere' 1% improvement on a result of 75% is not impressive, as it represents just a 4% reduction in apparent error, but the same 1% improvement on a result of 95% represents a 20% reduction in apparent error! In the end, a noisy gold standard sets an upper bound of less than 100% on performance, which is if nothing else counterintuitive. Sometimes more of a pyrite standard, really.</Paragraph> </Section> <Section position="3" start_page="5" end_page="8" type="sub_section"> <SectionTitle> 4.3 Ethical considerations </SectionTitle> <Paragraph position="0"> Of course, we cannot simply go about changing the corpus willy-nilly. We refer the reader to chapter 7 of David Magerman's thesis (1994) for a cogent discussion of why changing either the training or the testing data is a bad idea. However, we believe that there are now some changed circumstances that warrant a modi cation of this ethical dictum.</Paragraph> <Paragraph position="1"> First, we are not allowed to look at testing data.</Paragraph> <Paragraph position="2"> How to correct it, then? An initial reaction might be to \promise&quot; to forget everything seen while correcting the test corpus; this is not reasonable.</Paragraph> <Paragraph position="3"> Another solution exists, however, which is nearly as good and doesn't raise any ethical questions.</Paragraph> <Paragraph position="4"> Many research groups already use yet another section, separate from both the training and testing, as a sort of development corpus.</Paragraph> <Paragraph position="5"> When developing an algorithm, we must look at some output for debugging, preliminary evaluation, and parameter estimation; so this development section is used for testing until a piece of work is ready for publication, at which point the \true&quot; test set is used. Since we are all reading this development output already anyway, there is no harm in reading it to perform corrections thereon. In publication, then, one can publish the results of an algorithm on both the unaltered and corrected versions of the development section, in addition to the results on the unaltered test section. We can then presume that a corrected version of the test corpus would result in a perceived error reduction comparable to that on the development corpus. Another problem mentioned in that chapter is of a researcher quietly correcting a test corpus, and publishing results on the modi ed data (without even noting that it was modi ed). The solution to this is simple: any results on modi ed data will need to acknowledge that the data is modi ed (to be honest), and those modi cations need to be made public (to facilitate comparisons by later researchers). For Type A errors xed by a simple rule, it may be reasonable to publish them directly in the paper that gives the results.</Paragraph> <Paragraph position="6"> For Type B errors, it would be more reasonable to simply publish them on a website, since there are bound to be a large number of them.</Paragraph> <Paragraph position="7"> In the treebank, this is usually section 24.</Paragraph> <Paragraph position="8"> Theruleweusedto xtheLGS problem noted in section 2.1 is as follows:</Paragraph> <Paragraph position="10"> The 235 corrections made to section 24 are available at http://www.cs.brown.edu/~dpb/tbfix/.</Paragraph> <Paragraph position="11"> Finally, we would like to note that one of the reasons Magerman was ready to dismiss error in the testing was that the test data had \a consistency rate much higher than the accuracy rate of state-of-the-art parsers&quot;. This is no longer true.</Paragraph> </Section> <Section position="4" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 4.4 Practical considerations </SectionTitle> <Paragraph position="0"> As multiple researchers each begin to impose their own corrections, there are several new issues that will come up. First of all, even should everyone publish their own corrections, and post comparisons to previous researchers' corrected results, there is some danger that a variety of di erent correction sets will exist concurrently. To some extent this can be mitigated if each researcher posted both their own corrections by themselves, and a full list of all corrections they used (including their own). Even so, from time to time these varied correction sets will need to be collected and merged for the whole community to use.</Paragraph> <Paragraph position="1"> More di cult to deal with is the fact that, inevitably, there will be disputes as to what is correct. Sometimes these will be between the treebank version and a proposed correction; there will probably also be cases where multiple competing corrections are suggested. There really is no good systematic policy for dealing with this. Disputes will have to be handled on a case-by-case basis, and researchers should probably note any disputes to their corrections that they know of when publishing results, but beyond that it will have to be up to each researcher's personal sense of ethics.</Paragraph> <Paragraph position="2"> In all cases, a search-and-replace pattern should be made as general as possible (without being too general, of course), so that it interacts well with other modi cations. Various researchers are already working with (deterministically) di erent versions of corpora|with new tags added, or empty nodes removed, or some tags collapsed, for instance, not to mention other corrections already performed|and it would be a bad idea to distribute corrections that are speci c to one version of these. When in doubt, one should favour the original form of the corpus, naturally. null The nal issue is not a practical problem, but an observation: once a researcher publishes a correction set, any further corrections by other researchers are likely to decrease the results of the rst researcher's algorithm, at least somewhat. This is due to the fact that that researcher is usually not going to notice corpus errors when the algorithm errs in the same way. This unfortunate consequence is inevitable, and hopefully will prove minor.</Paragraph> </Section> </Section> class="xml-element"></Paper>