File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/96/c96-2114_concl.xml

Size: 5,868 bytes

Last Modified: 2025-10-06 13:57:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2114">
  <Title>Linguistic Indeterminacy as a Source of Errors in Tagging</Title>
  <Section position="6" start_page="678" end_page="679" type="concl">
    <SectionTitle>
6 Tagging Undecidable Situations
</SectionTitle>
    <Paragraph position="0"> How are bidirectional error patterns like the one above to be treated? Looking at their close context, it is often impossible to handle the situation with some smart tagging restriction or other device. They are also so equal in number and so frequent that one cannot simply decide to let one reading overrule the other and live with the errors that such a happy-golucky solution would give rise to. (As a practicing corpus tagger, 1 know that this unorthodox method can sometimes be the best way out of problematic situations.) Another possibility would be to amalgamate tile two readings into one, bivalued or underspecified, depending on how one chooses to see it. As ah'eady mentioned, these more or iess undecidable bidirectional patterns have been observed and discussed by others working with tile tagging of large corpora, and they have, seemingly independently of each other, come up with similar suggestions. Below are three quotations dealing with this matter.</Paragraph>
    <Paragraph position="1"> The Penn Treebank: 'ltowever, even given explicit criteria for assigning POS tags to potentially ambiguous words, it is not always possible to assign a unique tag to a word with confidence. Since a major concern of the Treebank is to avoid requiring annotators to make arbitrary decisions, we allow words to be associated with more than one POS tag.</Paragraph>
    <Paragraph position="2"> Such multiple tagging indicates either that the word's part of speech simply cannot be decided or that the annotator is unsure which of the alternative tags is the correct one.' (Marcus et al. 1993, 316.) The British National Corpus: 'In order to provide more useful results in a substantial proportion of the residual words which cannot be successfully tagged, we have introduced portmanteau tags. A portmanteau tag is used ill a situation where there is insufficient evidence for Claws to make a clear distinction between two tags. Thus, in the notoriously difficult choice between a past participle and the past tense of a verb, if there is insufficient probabilistic evidence to choose between the two Claws marks the word as VVN-VVD. A set of fifteen such portmanteau tags have been declared, covering the major pairs of confusable tags.' (Garside 1995.) Constraint Grammar: 'In the rare cases where two analyses were regarded as equally legitimate, both could be marked.' (Voutilainen and Jfirvinen 1995, 212.) It is, however, important that the s/tuations where underspecified tags can be used are restricted to well-defined cases and that the reasons for using them are quite clear. They should have what I call a 'mirror' character, in that the interchange goes in both directions, and they should concern clearly distinct pairs of tags even when a word has several other tags as well. Such situations are more common in automatic tagging but they occur in manual tagging as well.</Paragraph>
    <Paragraph position="3"> The reasons for a situation being undecidable can, however, vary. Voutilainen and J~irvinen, in their study of inter-annotator agreement, mention three situations where an nnderdetermined analysis was accepted: 'When the judges disagree about the correct analysis even after negotiations. In this case, comments were added to distinguish it from the other two types. Neutralisation: both analyses were regarded as equivalent. (This often indicates a redundancy in the lexicon.) Global ambiguity: the sentence was agreed to be globally ambiguous.' (Voutilainen and J~trvinen 1995, 212.) Marcus et at. (1993) allow underspecified tagging both for annotators' uncertainty or disagreement and for cases that correspond to Voutilainen and J~irvinen's neutralisation and global ambiguity. This may be infelicitous. It is important to keep a clear borderline between situations that could be solved in principle and such that are truly undecidable. The latter ones may lead us to questions about the nature of language and to what extent natural language really is exact and welldefined.</Paragraph>
    <Paragraph position="4"> Introducing underspecified tags would influence the training and performance of a probabilistic tagger in at least the tbllowing ways: a) The concerned words would mostly get more alternative tags, one for each of the unambigous readings plus one for the underspecified one. According to common tagging principles, this would be a disadvantage, b) There would be fewer obserw, tions of each of the alternative tags, as the competing unambiguous tags both would lose some of their instances to their common underspecified alternative. This would also be a disadvantage, c) The observations of each tag would hopefully be more correct, as the instances 'lost' to the underspecified tag would be the tricky and atypical cases that otherwise might obscure the contextual patterns of the unambiguous tags. d) The underspecified instances can later be automatically retrieved for either manual inspection or some more elaborate disambiguation device.</Paragraph>
    <Paragraph position="5"> It is still an open question whether the more clear-cut distinctions introduced by the underspecified tags compensate 1or the accompanying disadvantages, but at least they have the intellectually pleasing property of showing where there are truly ambiguous situations in language. By systematic modifications of the tagset along these lines it is possible to decide to what extent the introduction of underspecified tags will improve tile overall performance of a tagger and/or facilitate the task of human annotators.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML