File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/92/a92-1021_evalu.xml
Size: 3,435 bytes
Last Modified: 2025-10-06 14:00:09
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1021"> <Title>A Simple Rule-Based Part of Speech Tagger</Title> <Section position="4" start_page="153" end_page="154" type="evalu"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"> The tagger was tested on 5% of the Brown Corpus including sections from every genre. First, the test corpus was tagged by the simple lexical tagger. Next, each of the patches was in turn applied to the corpus. Below is a graph showing the improvement in accuracy from applying patches. It is significant that with only 7i patches, an error rate of 5.1% was obtained 5. Of the 71 patches, 66 resulted in a reduction in the number of errors in the test corpus, 3 resulted in no net change, and 2 resulted in a higher number of errors. Almost all patches which were effective on the training corpus were also effective on the test corpus.</Paragraph> <Paragraph position="1"> Unfortunately, it is difficult to compare our results with other published results. In \[Meteer et al. 91\], an error rate of 3-4% on one domain, Wall Street Journal articles and 5.6% on another domain, texts on terrorism in Latin American countries, is quoted. However, both the domains and the tag set are different from what we use. \[Church 88\] reports an accuracy of &quot;95-99% correct, depending on the definition of correct&quot;. We implemented a version of the algorithm described by Church.</Paragraph> <Paragraph position="2"> When trained and tested on the same samples used in our experiment, we found the error rate to be about 4.5%. \[DeRose 88\] quotes a 4% error rate; however, the sample used for testing was part of the training corpus.</Paragraph> <Paragraph position="3"> \[Garside et al. 87\] reports an accuracy of 96-97%. Their probabilistic tagger has been augmented with a hand-crafted procedure to pretag problematic &quot;idioms&quot;. This procedure, which requires that a list of idioms be la- null boriously created by hand, contributes 3% toward the accuracy of their tagger, according to \[DeRose 88\]. The idiom list would have to be rewritten if one wished to use this tagger for a different tag set or a different corpus.</Paragraph> <Paragraph position="4"> It is interesting to note that the information contained in the idiom list can be automatically acquired by the rule-based tagger. For example, their tagger had difficulty tagging as old as. An explicit rule was written to pretag as old as with the proper tags. According to the tagging scheme of the Brown Corpus, the first as should be tagged as a qualifier, and the second as a subordinating conjunction. In the rule-based tagger, the most common tag for as is subordinating conjunction. So initially, the second as is tagged correctly and the first as is tagged incorrectly. To remedy this, the system acquires the patch: if the current word is tagged as a subordinating conjunction, and so is the word two positions ahead, then change the tag of the current word to qualifier. 6 The rule-based tagger has automatically learned how to properly tag this &quot;idiom.&quot; Regardless of the precise rankings of the various taggers, we have demonstrated that a simple rule-based tagger with very few rules performs on par with stochastic taggers.</Paragraph> <Paragraph position="5"> eThis was one of the 71 patches acquired by the rule-based tagger.</Paragraph> </Section> class="xml-element"></Paper>