File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/a94-1008_concl.xml

Size: 4,384 bytes

Last Modified: 2025-10-06 13:57:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1008">
  <Title>Tagging accurately- Don't guess if you know</Title>
  <Section position="7" start_page="50" end_page="51" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper we have demonstrated how knowledge-based and statistical techniques can be combined to improve the accuracy of a part of speech tagger. Our system reaches a better than 98 % accuracy using a relatively fine-grained grammatical representation.</Paragraph>
    <Paragraph position="1"> Some concluding remarks are in order.</Paragraph>
    <Paragraph position="2"> 3Even without the mapping errors, the reported 4 % error rate of XT is considerably higher than that of our hybrid.</Paragraph>
    <Paragraph position="3"> * Using linguistic information before a statistical module provides a better result than using a statistical module alone.</Paragraph>
    <Paragraph position="4"> * ENGCG leaves some 'hard' ambiguities unresolved (about 3-7 % of all words). This amount is characteristic of the ENGCG rule-formMism, tagset and disambiguation grammar. It does not necessarily hold for other knowledge-based systems.</Paragraph>
    <Paragraph position="5"> * Only about 20-25 % of errors made by the statistical component occur in the analysis of these 'hard' ambiguities. That means, 75-80 % of the errors made by the statistical tagger were resolved correctly using linguistic rules.</Paragraph>
    <Paragraph position="6"> * Certain kinds of ambiguity left pending by ENGCG, e.g. CS vs. PREP, are resolved rather unreliably by XT.</Paragraph>
    <Paragraph position="7"> * The overall result is better than other state-of-the-art part-of-speech disambiguators. In our 27000 word test sample from previously unseen corpus, 98.5 % of words received a correct analysis. In other words, the error rate is reduced at least by half.</Paragraph>
    <Paragraph position="8"> Although the result is better than provided by any other tagger that produces fully disambiguated output, we believe that the result could still be improved. Some possibilities: * We could use partly disambiguated text (e.g. the output of parsers D1, D2 or D3~) and disambiguate the result using a knowledge-based syntactic parser (see experiments in (Voutilainen and Tapanainen, 1993)).</Paragraph>
    <Paragraph position="9"> * We could leave the text partly disambiguated, and use a syntactic parser that uses both linguistic knowledge and corpus-based heuristics (see (Tapanainen and J//rvinen, 1994)).</Paragraph>
    <Paragraph position="10"> * Some ambiguities are very difficult to resolve in a small window that statistical taggers currently use (e.g. CS vs. PREP ambiguity when a noun phrase follows). A better way to resolve them would probably be to write (heuristic) rules.</Paragraph>
    <Paragraph position="11"> * We could train the statistical tagger on the output of a knowledge-based tagger. That is problematic because generally statistical methods seem to require some compact set of tags, while a knowledge-based system needs more informative tags. The tag set of a knowledge-based system should be reduced down to some subset.</Paragraph>
    <Paragraph position="12"> That might prevent some mapping errors but there is no quarantee that the statistical tagger would work any better.</Paragraph>
    <Paragraph position="13"> * We could try the components in a different order: using statistics before heuristical knowledge etc. However, currently the heuristic component makes less errors than the statistical tagger.</Paragraph>
    <Paragraph position="14">  the accuracy of XT is almost the same as the accuracy of any other statistical tagger. What is more, the accuracy of the purely statistical taggers has not been greatly increased since the first of its kind, CLAWS1, (Marshall, 1983) was published over ten years ago.</Paragraph>
    <Paragraph position="15"> We believe that the best way to boost the accuracy of a tagger is to employ even more linguistic knowledge. The knowledge should, in addition, contain more syntactic information so that we could refer to real (syntactic) objects of the language, not just a sequence of words or parts of speech. Statistical information should be used only when one does not know how to resolve the remaing ambiguity, and there is a definite need to get fully unambiguous output. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML