File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0118_intro.xml
Size: 1,028 bytes
Last Modified: 2025-10-06 14:03:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0118"> <Title>Voting between Dictionary-based and Subword Tagging Models for Chinese Word Segmentation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Tokenizing input text into words is the first step of any text analysis task. In Chinese, a sentence is written as a string of characters, to which we shall refer by their traditional name of hanzi, without separations between words. As a result, before any text analysis on Chinese, word segmentation task has to be completed so that each word is &quot;isolated&quot; by the word-boundary information.</Paragraph> <Paragraph position="1"> Participating in the third SIGHAN Chinese Word Segmentation Bakeoff in 2006, our system is tested on the closed track of CityU, MSRA and UPUC corpora. The sections below provide a detailed description of the system and our experimental results.</Paragraph> </Section> class="xml-element"></Paper>