File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1030_intro.xml

Size: 1,513 bytes

Last Modified: 2025-10-06 14:05:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1030">
  <Title>IMPROVING CHINESE TOKENIZATION WITH LINGUISTIC FILTERS ON STATISTICAL LEXICAL ACQUISITION</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang &amp;Chen 1993; Chiang et al. 1992; Linet al. 1993; Wu &amp; Tseng 1993; Sproat et al. 1994).</Paragraph>
    <Paragraph position="1"> We present empirical evidence for four points concerning tokenization of Chinese text: (I) More rigorous &amp;quot;blind&amp;quot; evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML