File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0305_intro.xml

Size: 2,360 bytes

Last Modified: 2025-10-06 14:05:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0305">
  <Title>HMM-based Part-of-Speech Tagging for Chinese Corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Part-of-speech tagged corpora are very useful for natural language processing (NLP) applications such as speech recognition, text-to-speech, information retrieval, and machine translation systems. Automatic part-of-speech tagging has been intensively studied and practiced for European languages \[1.--4,7,8, 10\].</Paragraph>
    <Paragraph position="1"> However, the technology of automatic Chinese part-of-speech tagging is still in its infancy, due to the following reasons:  1. Definition of words in Chinese is not clear; there  are not breaks between two adjacent words. For example, the string ~--~tg contains four characters, but it can be divided into one, two, three, or four words by different linguists. Other difficult cases include compound words (e.g., ~1~), split words (e.g., ~), acronyms (e.g.,~  ), and literay words.</Paragraph>
    <Paragraph position="2"> 2. Word segmentation can not be fully automatic. 3. Well-defined tag set for Chinese part-of-speech is not available.</Paragraph>
    <Paragraph position="3"> 4. A Chinese lexicon with complete parts-of-speech is hard to find.</Paragraph>
    <Paragraph position="4"> 5. Chinese part-of-speech tagging is difficult even  for human, i.e., the parts-of-speech for many words are either arguable or difficult to decide. 6. Manually tagged Chinese corpora, counterparts of Brown corpus and LOB corpus in Chinese, are not available.</Paragraph>
    <Paragraph position="5"> These intertwined problems make Chinese part-of-speech tagging an especially difficult task. Lee and Chang Chien \[5, 6\] used a Tri-POS Markov language model and a bootstrap training process for tagging a small Chinese corpus (1714 sentences for training and 233 sentences for testing). They reported a tagging accuracy 81.13% for all words and 87.60% for known words.</Paragraph>
    <Paragraph position="6"> In this paper, we present our work on part-of-speech tagging a large Chinese corpus based on a hidden Markov model (HMM). This is among the first reports on automatic Chinese part-of-speech tagging in the literature \[5, 6\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML