File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1021_metho.xml

Size: 7,480 bytes

Last Modified: 2025-10-06 14:12:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1021">
  <Title>A Simple Rule-Based Part of Speech Tagger</Title>
  <Section position="3" start_page="0" end_page="153" type="metho">
    <SectionTitle>
2 The Tagger
</SectionTitle>
    <Paragraph position="0"> The tagger works by automatically recognizing and remedying its weaknesses, thereby incrementally improving its performance. The tagger initially tags by assigning each word its most likely tag, estimated by examining a large tagged corpus, without regard to context. In both sentences below, run would be tagged as a verb:  The run lasted thirty minutes.</Paragraph>
    <Paragraph position="1"> We run three miles every day.</Paragraph>
    <Paragraph position="2"> The initial tagger has two procedures built in to improve performance; both make use of no contextual information. One procedure is provided with information that words that were not in the training corpus and are capitalized tend to be proper nouns, and attempts to fix tagging mistakes accordingly. This information could be acquired automatically (see below), but is prespecified in the current implementation. In addition, there is a procedure which attempts to tag words not seen in the training corpus by assigning such words the tag most common for words ending in the same three letters. For example, blahblahous would be tagged as an adjective, because this is the most common tag for words ending in ous. This information is derived automatically from the training corpus.</Paragraph>
    <Paragraph position="3"> This very simple algorithm has an error rate of about 7.9% when trained on 90% of the tagged Brown Corpus 1 \[Francis and Ku~era 82\], and tested on a separate 5% of the corpus. 2 Training consists of compiling a list of the most common tag for each word in the training corpus.</Paragraph>
    <Paragraph position="4"> The tagger then acquires patches to improve its performance. Patch templates are of the form:  * If a word is tagged a and it is in context C, then change that tag to b, or * If a word is tagged a and it has lexical property P, then change that tag to b, or * If a word is tagged a and a word in region R has  lexical property P, then change that tag to b.</Paragraph>
    <Paragraph position="5"> The initial tagger was trained on 90% of the corpus (the training corpus). 5% was held back to be used for the patch acquisition procedure (the patch corpus) and 5% for testing. Once the initial tagger is trained, it is used to tag the patch corpus. A list of tagging errors is compiled by comparing the output of the tagger to the correct tagging of the patch corpus. This list consists of triples &lt; tag~, tagb, number &gt;, indicating the number of times the tagger mistagged a word with taga when it should have been tagged with tagb in the patch corpus. Next, for each error triple, it is determined which instantiation of a template from the prespecified set of patch templates results in the greatest error reduction. Currently, the patch templates are: Change tag a to tag b when:  3. One of the two preceding (following) words is tagged Z.</Paragraph>
    <Paragraph position="6"> 4. One of the three preceding (following) words is tagged z.</Paragraph>
    <Paragraph position="7"> 5. The preceding word is tagged z and the following word is tagged w.</Paragraph>
    <Paragraph position="8"> 6. The preceding (following) word is tagged z and the word two before (after) is tagged w.</Paragraph>
    <Paragraph position="9"> 7. The current word is (is not) capitalized.</Paragraph>
    <Paragraph position="10"> 8. The previous word is (is not) capitalized.</Paragraph>
    <Paragraph position="11">  For each error triple &lt; taga,tagb, number &gt; and patch, we compute the reduction in error which results from applying the patch to remedy the mistagging of a word as taga when it should have been tagged tagb. We then compute the number of new errors caused by applying the patch; that is, the nmnber of times the patch results in a word being tagged as tagb when it should be tagged taga. The net improvement is calculated by subtracting the latter value from the former.</Paragraph>
    <Paragraph position="12"> For example, when the initial tagger tags the patch corpus, it mistags 159 words as verbs when they should be nouns. If the patch change the lag from verb to noun if one of the two preceding words is lagged as a determiner is applied, it corrects 98 of the 159 errors. However, it results in an additional 18 errors from changing tags which really should have been verb to noun. This patch results in a net decrease of 80 errors on the patch corpus.</Paragraph>
    <Paragraph position="13"> The patch which results in the greatest improvement to the patch corpus is added to the list of patches. The patch is then applied in order to improve the tagging of the patch corpus, and the patch acquisition procedure continues.</Paragraph>
    <Paragraph position="14"> The first ten patches found by the system are listed  below 3.</Paragraph>
    <Paragraph position="15"> (1) TO iN NEXT-TAG AT (2) VBN VBD PREV-WORD-IS-CAP YES (3) VBD VBN PREV-1-OR-2-OR-3-TAG HVD (4) VB NN PREV-1-OR-2-TAG AT (5) NN VB PREV-TAG TO (6) TO IN NEXT-WORD-IS-CAP YES (7) NN VB PREV-TAG MD (8) PPS PPO NEXT-TAG.</Paragraph>
    <Paragraph position="16"> (9) VBN VBD PREV-TAG PPS (10) NP NN CURRENT-WORD-IS-CAP NO 1. The preceding (following) word is tagged z.</Paragraph>
    <Paragraph position="17"> 2. The word two before (after) is tagged z.</Paragraph>
    <Paragraph position="18">  a variety of genres of written English. There are 192 tags in the tag set, 96 of which occur more than one hundred times in the corpus.</Paragraph>
    <Paragraph position="19"> 2The test set contained text from all genres in the Brown Corpus.</Paragraph>
    <Paragraph position="20"> The first patch states that if a word is tagged TO and the following word is tagged AT, then switch the tag from TO to IN. This is because a noun phrase is</Paragraph>
    <Paragraph position="22"> modal, NN = sing. noun, NP = proper noun, PPS = 3rd sing. nora. pronoun, PPO = obj. personal pronoun, TO = infinitive to, VB = verb, VBN = past part. verb, VBD = past verb.</Paragraph>
    <Paragraph position="23">  much more likely to immediately follow a preposition than to immediately follow infinitive TO. The second patch states that a tag should be switched from VBN to VBD if the preceding word is capitalized. This patch arises from two facts: the past verb tag is more likely than the past participle verb tag after a proper noun, and is also the more likely tag for the second word of the sentence. 4 The third patch states that VBD should be changed to VBN if any of the preceding three words are tagged HVD.</Paragraph>
    <Paragraph position="24"> Once the list of patches has been acquired, new text can be tagged as follows. First, tag the text using the basic lexical tagger. Next, apply each patch in turn to the corpus to decrease the error rate. A patch which changes the tagging of a word from a to b only applies if the word has been tagged b somewhere in the training corpus.</Paragraph>
    <Paragraph position="25"> Note that one need not be too careful when constructing the list, of patch templates. Adding a bad template to the list will not worsen performance. If a template is bad, then no rules which are instantiations of that template will appear in the final list of patches learned by the tagger. This makes it easy to experiment with extensions to the tagger.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML