File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-3033_intro.xml

Size: 2,712 bytes

Last Modified: 2025-10-06 14:03:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-3033">
  <Title>Towards a Hybrid Model for Chinese Word Segmentation</Title>
  <Section position="2" start_page="0" end_page="189" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a hybrid Chinese word segmenter that participated in the closed track of the Peking University Corpus in the Second International Chinese Word Segmentation Bakeoff. This segmenter is still in its early stage of development and is being developed as part of a larger Chinese unknown word resolution system that performs the identification, part of speech guessing, and sense guessing of Chinese unknown words (Lu, 2005).</Paragraph>
    <Paragraph position="1"> The segmenter consists of two major components. First, a tagging component tags each individual character in a sentence with a position-of-character (POC) tag that indicates the position of the character in a word. This could be one of the following four possibilities, i.e., the character is either a monosyllabic word or is in a wordinitial, middle, or final position. This component is based on the transformation-based learning (TBL) algorithm (Brill, 1995), where a simple first-order HMM tagger (Charniak et al., 1993) is used to produce an initial tagging of a character sequence. Second, a merging component transforms the output of the tagging component, i.e., a POC-tagged character sequence, into a word-segmented sentence. Whereas this process relies largely on the POC tags assigned to the individual characters, it also takes advantage of a number of heuristics generalized from the training data to handle non-Chinese characters, numeric type compounds, and long words.</Paragraph>
    <Paragraph position="2"> The approach adopted here is reminiscent of the line of research that employs the idea of character-based tagging for Chinese word segmentation and/or unknown word identification (Goh et al., 2003; Xue, 2003; Zhang et al., 2002). The notion of character-based tagging allows us to model the tendency for individual characters to combine with other characters to form words in different contexts. This property gives the model a good potential for improving the performance of Chinese unknown word identification, a major concern of the Chinese unknown word resolution system that the segmenter is a part of.</Paragraph>
    <Paragraph position="3"> The rest of the paper is organized as follows.</Paragraph>
    <Paragraph position="4"> Section two describes the system architecture.</Paragraph>
    <Paragraph position="5"> Section three reports the results of the system in the bakeoff. Section four concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML