File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/96/c96-2200_concl.xml
Size: 2,812 bytes
Last Modified: 2025-10-06 13:57:39
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2200"> <Title>CHINESE STRING SEARCHING USING TtIE KMP ALGORITHM</Title> <Section position="6" start_page="1113" end_page="1113" type="concl"> <SectionTitle> 5. Practical considerations. </SectionTitle> <Paragraph position="0"> The KMP algorithm (Knuth et al., 1977) was considered to perform better when the pattern string has recurrence patterns. Otherwise, it is about the same as the brate-force implementation with quadratic time-complexity. For Chinese string searching, it is not uncommon to search for reduplicating words (e.g.</Paragraph> <Paragraph position="1"> ~3&quot;'S~.3 and SSOSS(31AOIAO) (Chen et al., 1992) which has recurrence patterns. Such repetition to form words is used in making emphasis as well as an essential part of yes-no questions. Otherwise, recurrence patterns in P occur only incidentally (e.g. nn~j~n~WA~3Aq&quot;t translated as the Department of Chinese, Chinese University of Hong Kong).</Paragraph> <Paragraph position="2"> Apart from recurrence, if there are a lot of backing up operations, the KMP algorithm would perform better than the brute-force implementation. Such cases occur where a proper prefix of the pattern string has high occurrence frequency in the text string (e.g.</Paragraph> <Paragraph position="3"> function words). In Chinese string searching, this will happen for technical terms that have a high frequency prefix constituent. For instance, Chinese law articles have many terms beginning with the word ~deg~ (i.e.</Paragraph> <Paragraph position="4"> China). A search through the Chinese law text for P:~%~H will require many backing up (or committing a false start) in the brute-force implementation when words or phrases like ~D%&quot;k<ffS, c~%&quot;deg>)fi~g, cmdegOkDAv and c~c~%',D~k are encountered. Sometimes, patterns which are words can match with text where the matched string of the text is not functioning as a word. For example, nj.\[ (which means conference) can be regarded as a word but in the phrase, 2&quot;~j.lP}~@Pi.s&quot;deg~Abe, the first character (underlined) of the matched string (in italics) is part of a name and the second character (in italics) function as a verb, Thus, Chinese text is often pre-segmented and string searching has to patch delimiters to the beginning and end of the pattern, P. However, the searching accuracy depends on the segmentation algorithm which is usually implemented as a dictionary look-up procedure. If a dictionary has poor coverage, the text tends to be over-segmented (Luk, 1994) and the recall performance of searching will drop drastically. Such cases occur if a general dictionary is used in segmenting technical articles (e.g. in law, medicine, computing, etc).</Paragraph> </Section> class="xml-element"></Paper>