File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1050_metho.xml

Size: 4,603 bytes

Last Modified: 2025-10-06 14:15:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1050">
  <Title>A Corpus-Based Approach to Deriving Lexical Mappings</Title>
  <Section position="4" start_page="0" end_page="285" type="metho">
    <SectionTitle>
2 Case Study
</SectionTitle>
    <Paragraph position="0"> In order to test this approach we attempted to map together two part of speech tag-sets. We chose this form of linguistic annotation because it is commonly used in NLP systems and reliable taggers are readily available.</Paragraph>
    <Paragraph position="1"> The tags sets we shall examine are the set used in the Penn Tree Bank (PTB) (Marcus et al., 1993) and the C5 tag-set used by the CLAWS part-of-speech tagger (Garside, 1996). The PTB set consists of 48 annotations while the C5 uses a larger set of 73 tags.</Paragraph>
    <Paragraph position="2"> A portion of the British National Corpus (BNC), consisting of nearly 9 million words, was used to derive a mapping. One advantage of using the BNC is that it has already been tagged with C5 tags. The first stage was to re-tag our corpus using the Brill tagger (Brill, 1994). This produces a bi-tagged corpus in which each token has two annotations. For example ponders/VBZ/VVZ, which represents the token is ponders assigned the Brill tag VBZ and VVZ C5 tag.</Paragraph>
    <Paragraph position="3"> The bi-tagged corpus was used to derive a pair of mappings; the word mapping and the tag mapping. To construct the word mapping from the PTB to C5 we look at each token-PTB tag pair  and found the C5 tag which occurs with it most frequently. The tag mapping does not consider tokens so, for example, the PTB to C5 tag mapping looks at each PTB tag in turn to find the C5 tag with which it occurs most frequently in the corpus. The C5 to PTB mappings were derived by reversing this process.</Paragraph>
    <Paragraph position="4"> In order to test our method we took a text tagged with one of the two tag-sets used in our experiments and translate that tagging to the other. We then compare the newly annotated text against some with &amp;quot;gold standard&amp;quot; tagging. It is trivial to obtain text annotated with C5 tags using the BNC. Our evaluation of the C5 to PTB mapping shall operate by tagging a text using the Brill tagger, using the derived mapping to translate the annotations to C5 tags and compare the annotations produced with those in the BNC text.</Paragraph>
    <Paragraph position="5"> However, it is more difficult to obtain gold standard text for evaluating the mapping in the reverse direction since we do not have access to a part of speech tagger which assigns C5 tags. That is, we cannot annotate a text with C5 tags, use our mapping to translate these to PTB tags and compare against the manual annotations from the corpus.</Paragraph>
    <Paragraph position="6"> Instead of tagging the unannotated text we use the existing C5 tags and translate those to PTB tags. Each approach to producing gold standard data has problems and advantages. The Brill tagger has a reported error rate of 3% and so cannot be expected to produce perfectly annotated text.</Paragraph>
    <Paragraph position="7"> However, when we tag the text with PTB tags and use the mapping to translate these taggings to C5 annotations we have no way to determine whether erroneous C5 tags were produced by errors in the Brill tagging or the mapping.</Paragraph>
    <Paragraph position="8"> Our test corpus was a text from the BNC consisting of 40,397 tokens. Both word and tag mappings were created in each direction (PTB to C5 and C5 to PTB). To apply the tag mapping we simply used it to convert the assigned annotation from one tag-set to the other. However, when the word mapping is applied there is the danger that a word-tag pair may not appear in the mapping and, if this is the case, the tag mapping is used as a default map.</Paragraph>
    <Paragraph position="9"> The results from our evaluation are shown in Table 1. We can see that the C5 to PTB word mapping produces impressive results which are close to the theoretical upper bound of 97% for the task. In addition the word mapping in the opposite direction is correct for 95% of tokens.</Paragraph>
    <Paragraph position="10"> Although the results for the word mappings in each direction are quite similar, there is a significant difference in the performances of the default  mappings, 86% and 74%. Analysis suggests that the PTB to C5 default mapping is less successful than the one which operates in the opposite direction because it attempts to reproduce the tags in a fine-grained set from a more general one.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML