File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1045_intro.xml
Size: 3,455 bytes
Last Modified: 2025-10-06 14:05:29
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1045"> <Title>EXAMPLE-BASED CORRECTION OF WORD SEGMENTATION AND PART OF SPEECH LABELLING</Title> <Section position="3" start_page="0" end_page="227" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Probabilistic part of speech taggers have proven to be successful in English part of speech labelling \[Church 1988; DeRose, 1988; de Marcken, 1990; Meteer, et. al.</Paragraph> <Paragraph position="1"> 1991, etc.\]. Such stochastic models perform very well given adequate amounts of training data representative of operational data. Instead of merely stating what is possible, as a non-stochastic rule-based model does, probabilistic models predict the likelihood of an event.</Paragraph> <Paragraph position="2"> In determining the part of speech of a highly ambiguous word in context or in determining the part of speech of an unknown word, they have proven quite effective for English.</Paragraph> <Paragraph position="3"> By contrast, rule-based morphological analyzers employing a hand-crafted lexicon and a hand-crafted connectivity matrix are the traditional approach to Japanese word segmentation and part of speech labelling \[Aizawa and Ebara 1973\]. Such algorithms have already achieved 90-95% accuracy in word segmentation and 90-95% accuracy in part-of-speech labelling (given correct word segmentation). The potential advantage of a rule-based approach is the ability of a human coding rules that cover events that are rare, and therefore may be inadequately represented in most training sets.</Paragraph> <Paragraph position="4"> Furthermore, it is commonly assumed that large training sets are not required.</Paragraph> <Paragraph position="5"> A third approach combines a rule-based part of speech tagger with a set of correction templates automatically derived from a training corpus \[Brill 1992\].</Paragraph> <Paragraph position="6"> We faced the challenge of processing Japanese text, where neither spaces nor any other delimiters mark the beginning and end of words. We had at our disposal the following: A rule-based Japanese morphological processor (JUMAN) from Kyoto University.</Paragraph> <Paragraph position="7"> - A context-free grammar of Japanese based on part of speech labels distinct from those produced by JUMAN.</Paragraph> <Paragraph position="8"> - A probabilistic part-of-speech tagger (POST) \[Meteer, et al., 1991\] which assumed a single sequence of words as input.</Paragraph> <Paragraph position="9"> - Limited human resources for creating training data. This presented us with four issues: 1) how to reduce the cost of modifying the rule-based morphological analyzer to produce the parts of speech needed by the grammar, 2) how to apply probabilistic modeling to Japanese, e.g., to improve accuracy to -97%, which is typical of results in English, 3) how to deal with unknown words, where JUMAN typically makes no prediction regarding part of speech, and 4) how to estimate probabilities for low frequency phenomena.</Paragraph> <Paragraph position="10"> Here we report on an example-based technique for correcting systemmatic errors in word segmentation and part of speech labelling in Japanese text. Rather than using handcrafted rules, the algorithm employs example data, drawing generalizations during training. In motivation, it is similar to one of the goals of Brill (1992).</Paragraph> </Section> class="xml-element"></Paper>