File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0120_intro.xml

Size: 1,822 bytes

Last Modified: 2025-10-06 14:03:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0120">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Unlike Western languages, Chinese does not have explicit word delimiters. Therefore, word segmentation (CWS) is essential for Chinese text processing or indexing. There are two main problems in the closed CWS task. The first is to identify and segment non-Chinese word sequences in Chinese documents, especially in a closed task (described later). A good CWS system should be able to handle Chinese texts peppered with non-Chinese words or phrases. Since non-Chinese language morphologies are quite different from that of Chinese, our approach must depend on how many non-Chinese words appear, whether they are connected with each other, and whether they are interleaved with Chinese words. If we can distinguish non-Chinese characters automatically and apply different strategies, the segmentation performance can be improved. The second problem in closed CWS is to correctly identify longer NEs. Most ML-based CWS systems use a five-character context window to determine the current character's tag. In the majority of cases, given the constraints of computational resources, this compromise is acceptable. However, limited by the window size, these systems often handle long words poorly.</Paragraph>
    <Paragraph position="1"> In this paper, our goal is to construct a general CWS system that can deal with the above problems. We adopt CRF as our ML model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML