File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-0120_abstr.xml
Size: 1,377 bytes
Last Modified: 2025-10-06 13:45:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0120"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply K-means clustering to identify non-Chinese characters, and then adopt a two-tagger architecture: one for Chinese text and the other for non-Chinese text. For the latter problem, we apply postprocessing to our CWS output using automatically generated templates. The experiment results show that, when non-Chinese characters are sparse in the training corpus, our two-tagger method significantly improves the segmentation of sentences containing non-Chinese words. Identification of long NEs and long words is also enhanced by template-based postprocessing. In the closed task of SIGHAN 2006 CWS, our system achieved F-scores of 0.957, 0.972, and 0.955 on the CKIP, CTU, and MSR corpora respectively.</Paragraph> </Section> class="xml-element"></Paper>