File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0305_intro.xml
Size: 2,360 bytes
Last Modified: 2025-10-06 14:05:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0305"> <Title>HMM-based Part-of-Speech Tagging for Chinese Corpora</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Part-of-speech tagged corpora are very useful for natural language processing (NLP) applications such as speech recognition, text-to-speech, information retrieval, and machine translation systems. Automatic part-of-speech tagging has been intensively studied and practiced for European languages \[1.--4,7,8, 10\].</Paragraph> <Paragraph position="1"> However, the technology of automatic Chinese part-of-speech tagging is still in its infancy, due to the following reasons: 1. Definition of words in Chinese is not clear; there are not breaks between two adjacent words. For example, the string ~--~tg contains four characters, but it can be divided into one, two, three, or four words by different linguists. Other difficult cases include compound words (e.g., ~1~), split words (e.g., ~), acronyms (e.g.,~ ), and literay words.</Paragraph> <Paragraph position="2"> 2. Word segmentation can not be fully automatic. 3. Well-defined tag set for Chinese part-of-speech is not available.</Paragraph> <Paragraph position="3"> 4. A Chinese lexicon with complete parts-of-speech is hard to find.</Paragraph> <Paragraph position="4"> 5. Chinese part-of-speech tagging is difficult even for human, i.e., the parts-of-speech for many words are either arguable or difficult to decide. 6. Manually tagged Chinese corpora, counterparts of Brown corpus and LOB corpus in Chinese, are not available.</Paragraph> <Paragraph position="5"> These intertwined problems make Chinese part-of-speech tagging an especially difficult task. Lee and Chang Chien \[5, 6\] used a Tri-POS Markov language model and a bootstrap training process for tagging a small Chinese corpus (1714 sentences for training and 233 sentences for testing). They reported a tagging accuracy 81.13% for all words and 87.60% for known words.</Paragraph> <Paragraph position="6"> In this paper, we present our work on part-of-speech tagging a large Chinese corpus based on a hidden Markov model (HMM). This is among the first reports on automatic Chinese part-of-speech tagging in the literature \[5, 6\].</Paragraph> </Section> class="xml-element"></Paper>