File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2200_metho.xml
Size: 13,243 bytes
Last Modified: 2025-10-06 14:14:20
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2200"> <Title>CHINESE STRING SEARCHING USING TtIE KMP ALGORITHM</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. The Problem </SectionTitle> <Paragraph position="0"> Directly using existing fast string searching algorithms (Knuth et al.,1977; Boyer and Moore,1977) for on-line Chinese text can lead to errors in identification as in using the find option of Netscape in Chinese window. For example, the pattern string, P=~ (i.e. AA,AA in hexidecimal) can successfully match with the second and third bytes of the text string, T:Y=deg'7/ (i.e. A4,AA,AA,43 in hexidecimal) which is incorrect. The error occurs where the second byte of the character in 7' is interpreted as the first-byte of the pattern character. Thus, it is necessary to decode the input data as characters.</Paragraph> <Paragraph position="1"> Two well-known string searching algorithms were discovered by Knuth, Morris and Pratt (1977) (KMP), and Boyer and Moore (1977) (BM). The KMP algorithm has better worst-case time complexity where as the BM algorithm has better average-case time complexity. Recently, there has been some interest in improving (Hume arid Sunday, 1991; Crochemore et al., 1994) the time complexity or proving a smaller bound (Cole, 1994) of the time-complexity of the BM algorithm, as well as in the efficient construction (Baeza-Yates et al., 1994) of the BM algorithm. These algorithms derived from BM assumes that knowing the positional index, i, of the text string, 7, can access and interpret the data, T\[i\], as a character. However, with a text string of single- and multi-byte characters, i can point to the first-byte or the second-byte of a 2-byte character which the computer cannot determine in the middle of the text string. It has to scan left or right until a one-byte character, the beginning of the text string or the end of the text string is encountered. For example, the BM algorithm moves to position i : 4 (= lIPID for matching in Table 1. At this position, T\[4\] (= A4) does not match with P\[4\]. Since the computer cannot determine whether T\[4\] is the first or second byte of the 2-byte character, it cannot use the delta tables to determine the next matching states. Even worst, for some internal code (e.g. Big-5), it is not possible to directly convert the byte sequc~ce into the corresponding character sequence in the backward direction. Thus, as a first step, we focus on modifying the KMP for Chinese string searching.</Paragraph> <Paragraph position="3"> 'Fable I: Matching between the text string, T:L~PS~aHSS~f3 and the pattern string, p=<na>. Here, 7'\[\] and P\[\[ shows the hexidecimal value of each byte in T and P.</Paragraph> </Section> <Section position="5" start_page="0" end_page="1113" type="metho"> <SectionTitle> 3. Knuth-Morris-Pratt Algorithm. </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Searching </SectionTitle> <Paragraph position="0"> Figure 1 is the listing of the modified version of KMP algorithm (Knuth et aL, 1977) for searching llll Chinese string. Here, i is the positional index of the text string but the position is specified in terms of bytes. By comparison, j is the positional index of the pattern string, P, and the position is in terms of characters. Characters in P are stored in two arrays PI\[\] and P2\[\]. Here, PI\[\] stores the first-byte and P2\[\] stores the second byte of two-byte characters in P. If there are single-byte characters in P, they are stored in Pl\[\] and the data in corresponding positions of P2\[\] are undefined. Here, we assumed that a NULL character is patched at those positions. For example, if P=<c~PS<~PSY=i>, then the values in PI\[\] and P2\[\] are shown in Table 2.</Paragraph> <Paragraph position="1"> whether the current input is a single or 2-byte character, by testing whether the converted integer value of T\[i\] is positive or negative. If the converted value is negative, then 7&quot;.//.\] is the first-byte of a 2-byte character. Here, J T I and l J 7\]\] are the length of the text string, 7; in terms of characters and bytes, respectively.</Paragraph> <Paragraph position="2"> The program in Figure 1 determines (in line 6) whether the current input character is a single- or two-byte character. If it is a single-byte character, the standard KMP algorithm operates for that single-byte character, T\[i\], in line 7 to 10. Otherwise, i is pointing at a two-byte character. This implies that: (a) matching 2-byte characters is carried out where the data in T\[i+ 1\] is the second byte of the character (line 11); and (b) i is incremented by 2 instead of 1, because it is counting in terms of bytes (line 12). Sincej is counting in terms of characters, the increment ofj One 15) is one whether the characters in P are single or two bytes. When the pattern string is found in T, the position of the first matched character in T is returned. Since the position is in terms of bytes, it is the last matched position, i, minus the length of P in terms of bytes (i.e. IIPII).</Paragraph> <Paragraph position="4"> a conceptual array which can hold both single- and 2-byte characters. This array is implemented as two arrays: PI\[\] and P2\[\] which stores the first and second byte of the 2-byte characters, respectively. The function, f(), maps two byte characters into single-byte characters, simplifying the generation of values in the array, next\[\], and the failure links in fl\[\].</Paragraph> </Section> <Section position="2" start_page="0" end_page="1112" type="sub_section"> <SectionTitle> 3.2 Generating nextll </SectionTitle> <Paragraph position="0"> The array, next\[\], contains the failure link values which can be generated by existing algorithms (Standish, 1980) for single-byte characters. The basic idea is to map the 2-byte characters !:~ ~ to single-byte characters and then use existing algorithms. The mapping is implemented as an array, f\[\]. Each character in P is scanned from left-ro-right. Whenever an unseen character is found, it is assigned a character value that is the negative of the amount of different 2-byte characters seen so far. For example, the third unseen 2-byte character is mapped to a one-byte character, the value of which is (char) -3.</Paragraph> <Paragraph position="1"> The mapping scheme is practical. First, the number of different characters that can be represened with a negative value is 127 and usually IP\] < 128 characters.</Paragraph> <Paragraph position="2"> Second, the time-complexity of mapping, O(\] IP\[ D, can be done in linear time with respect to IPj and in constant time with respect to 17\]. This is important because it is added to the total time-complexity of searching. To achieve O(1 tPI D, the function, found(), uses an array, f\[\], of size 1El (where I2 is the alphabet) to store the equivalent single-byte characters. A perfect hash function (section 4), hO, converts the 2-byte characters into an index off\[\]. After searching, it is necessary to clear\]'\[\]. This can be (tone in O(\]IPLD by assigning NULL characters to the locations in f\[\] corresponding to 2-byte characters in P.</Paragraph> <Paragraph position="3"> 4. Finite automaton implementation.</Paragraph> <Paragraph position="4"> Since \[I 711 is large, reducing its multiplicative factor in the time complexity would be mtractive. In Knuth et al., (1977), this was done using a finite automaton which searches in O(\]IT\]D instead of 0(21171L).</Paragraph> <Paragraph position="5"> Standish (1980) provided an accessible algorithm to build the automaton, M. First, failure link values are computed (similar to computing values in next\[.\]) as in Algorithm 7.4 (Standish, 1980) and then the state transitions are added as in Algorithm 7.5 (Standish 1980). A direct approach is to compute the conceptual automaton, Me, which regards the 2-byte characters as one-byte and then convert the automaton for multi-byte processing. Since the space-time complexity in constructing the automaton depends on the size of the alphabet (i.e. o(\]ElxlQcD where Qc is the set of states of Me) which is large, this approach is not attractive.</Paragraph> <Paragraph position="6"> For instance, if IQcl -/0 and I~1 ~ I0,000, then about 100,000 milts of storage (integers) are needed! I,'urther processing is needed to convert the automaton for 2-byte processing!</Paragraph> </Section> <Section position="3" start_page="1112" end_page="1112" type="sub_section"> <SectionTitle> 4.1 Automaton lmplemeutation. </SectionTitle> <Paragraph position="0"> Another approach uses the different characters in P as the reduced alphabet, Er, which is much smaller than 121. We use a mapping function as discussed in section 3.2 to build a mapping of 2-byte characters to one-byte. These one-byte characters and the standard one-byte characters (e.g. ASC\[1) fbrm Er. The NULl, character, Z, represents all the characters in )..; but not in Zr = {X} ( = Z * 02r ~ {)@'). Given that the multi-byte string, P, is translbrmed into a single-byte string, l&quot;, existing algorithms can be used to construct the automaton.</Paragraph> <Paragraph position="1"> For each pattern string, 1', string searching will execute the tbllowing steps: (a) convert 2-byte characters to one-byte in P to lbrm t&quot; (i.e. lbrm PSr) using mapping as in section 3.2; (b)compute the failure link values of 1&quot; using / Algorithm 7.4 in (Standish, 1980); (c) compute the success transitions and store them in 80 as in (Standish, 1980); (d)compute the failure transitions using the failure link values using Algorithm 7.5 in (Standish, 1980) and store the transitions in 80; (e) use the atttomaton, M, with state transition fimction 80, to search for t&quot; in T; (1) output the matched position, if any; (g) clear that mapping lhnction that forms Zr using P.</Paragraph> </Section> <Section position="4" start_page="1112" end_page="1112" type="sub_section"> <SectionTitle> 4.2 Constructing the automaton. </SectionTitle> <Paragraph position="0"> For step (c) and (d), the operation of Algorithm 7.5 was illustrated with an example of a binary alphabet in (Standish, 1980). Here, we illustrate the use of a larger alphabet, Zr, and PS e Er. Suppose the pattern string, 1', is as shown in Table 2 which also contains the corresponding P' and failure link values, fl\[\]. The success transitions are added to 80 as 80'-I, P'\[j\]+- j (e.g. 8(0,<)4- l and 8(I,a)<-- 2). The failure transitions are computed from 0 to I/&quot;1 becausefl\[j\] <j. For state O, 8(0,00+- 0 ifo~ ~ P'\[1\] andcz c Er (i.e. 8(O,a) 4-- O, 8(0,b)4- O, 6(0,>)4- O, 8(O,X) 4- 0 but 8(0,o 0 ~- I).</Paragraph> <Paragraph position="1"> For other states, 8(j, c04- 8(fl\[/\],c 0 ifc~ C/ P'\[j\] and C/x Zr (e.g. 8(1,a)4- 807\[lJ, a)-8(O,a)=O and 8(I,<)48(fl\[1\],<)~8(0,<)=1). Effectively, the states in 8(/l\[/\],.) are copied across to the corresponding entries in 8(j,.) except for the successfid transition from j.</Paragraph> <Paragraph position="2"> Figure 2 illustrates how a:ro~ of entries in 6(/l\[/\],.) arc copied across to compute 80,.).</Paragraph> <Paragraph position="3"> state transition ruble, 60, is used for updating the values of the current row in 80. The underlined entries are the success transistions.</Paragraph> <Paragraph position="4"> Figure 3 shows the program that computes the state transitions using the faihtre links. The program computes for state 0, the last states and the other states separately. The last state is distinguished because it has no success transitions where as the other has one \['or each state. The program for generating failure links is not given because: (1) it is similar to computing next\[\]; (2) a version is available (Algorithm 7.4 in Standish, 1980) which does not need any modification.</Paragraph> <Paragraph position="5"> links are known. Note that the algorithm assumed that Zr : ZIuE2 where ZI and Z2 arc the one-byte (e.g. ASCII) clmracter alphabet and the transtbrmed l-byte character alphabet representing the different two-byte characters in P, respectively. Futhermore, since \[Y,2\[ < 128 and Z2 c Z. A multiplicative fimtor of the space-tim,,&quot; complexity can be reduced if mapping is also carried out for single-byte as well as 2-byte characters in 1'. The correctness of the above program can be shown by mapping all the characters not in :'2r to E because they have idenitical state mmsition wdues (i.e. dividing the alphabet into equivalent classes of identical transition vahms).</Paragraph> </Section> <Section position="5" start_page="1112" end_page="1113" type="sub_section"> <SectionTitle> 4.3 Searching. </SectionTitle> <Paragraph position="0"> Searching is implemented as state transitions of M (Figure 4). Initially, the state of the automaton, M, is set to 0. The next state is determined by the current character read from the text string, T, at position i and the current state. If the current state is equal to IP'I, then P is in Tat position i - \[\[Pl\].</Paragraph> </Section> </Section> class="xml-element"></Paper>