File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0118_intro.xml

Size: 1,028 bytes

Last Modified: 2025-10-06 14:03:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0118">
  <Title>Voting between Dictionary-based and Subword Tagging Models for Chinese Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Tokenizing input text into words is the first step of any text analysis task. In Chinese, a sentence is written as a string of characters, to which we shall refer by their traditional name of hanzi, without separations between words. As a result, before any text analysis on Chinese, word segmentation task has to be completed so that each word is &amp;quot;isolated&amp;quot; by the word-boundary information.</Paragraph>
    <Paragraph position="1"> Participating in the third SIGHAN Chinese Word Segmentation Bakeoff in 2006, our system is tested on the closed track of CityU, MSRA and UPUC corpora. The sections below provide a detailed description of the system and our experimental results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML