File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2200_intro.xml

Size: 1,452 bytes

Last Modified: 2025-10-06 14:06:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2200">
  <Title>CHINESE STRING SEARCHING USING TtIE KMP ALGORITHM</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The alphabet size of Chinese (to be more precise Hanyu) is relatively large (e.g. about 55,000 in Hanyu Da Cidian) compared with Indo-European languages.</Paragraph>
    <Paragraph position="1"> Various internal codes (e.g. GB, Big5, and Unicode) have been designed to represent a selected subset (5000-16,000) which requires two or more bytes to represent. For compatability with existing single-byte text, the most significant bit of the first byte is used to distinguish between multi-byte characters and single-byte characters. For instance, Web browsers (e.g.</Paragraph>
    <Paragraph position="2"> N etscape) cannot interpret the annotations represented by their equivalent 2-byte characters. Thus, Chinese string searching algorithms have to deal with a mixture of single- and multi-byte characters.</Paragraph>
    <Paragraph position="3"> This paper will focus in 2-byte characters because their internal codes are widely used. Two modified versions of the KMP algorithms are presented: the classical one and the finite-automaton implemenation.</Paragraph>
    <Paragraph position="4"> Finally, we discuss the practical situations in Chinese string searching.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML