File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2200_intro.xml
Size: 1,452 bytes
Last Modified: 2025-10-06 14:06:06
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2200"> <Title>CHINESE STRING SEARCHING USING TtIE KMP ALGORITHM</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The alphabet size of Chinese (to be more precise Hanyu) is relatively large (e.g. about 55,000 in Hanyu Da Cidian) compared with Indo-European languages.</Paragraph> <Paragraph position="1"> Various internal codes (e.g. GB, Big5, and Unicode) have been designed to represent a selected subset (5000-16,000) which requires two or more bytes to represent. For compatability with existing single-byte text, the most significant bit of the first byte is used to distinguish between multi-byte characters and single-byte characters. For instance, Web browsers (e.g.</Paragraph> <Paragraph position="2"> N etscape) cannot interpret the annotations represented by their equivalent 2-byte characters. Thus, Chinese string searching algorithms have to deal with a mixture of single- and multi-byte characters.</Paragraph> <Paragraph position="3"> This paper will focus in 2-byte characters because their internal codes are widely used. Two modified versions of the KMP algorithms are presented: the classical one and the finite-automaton implemenation.</Paragraph> <Paragraph position="4"> Finally, we discuss the practical situations in Chinese string searching.</Paragraph> </Section> class="xml-element"></Paper>