File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1104_metho.xml
Size: 7,414 bytes
Last Modified: 2025-10-06 14:09:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1104"> <Title>Adaptive Compression-based Approach for Chinese Pinyin Input</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Statistical Language Modelling </SectionTitle> <Paragraph position="0"> Statistical language modelling has been successfully applied to Chinese Pinyin input (Gao et al., 2002). The task of statistical language modelling is to determine the probability of a sequence of words.</Paragraph> <Paragraph position="2"> Given the previous i-1 words, it is difficult to compute the conditional probability if i is very large. An n-gram Markov model approximates this probability by assuming that only words relevant to predict are previous n-1 words. The most commonly used is trigram.</Paragraph> <Paragraph position="4"> The key difficulty with using n-gram language models is that of data sparsity. One can never have enough training data to cover all the ngrams. Therefore some mechanism for assigning non-zero probability to novel n-grams is a key issue in statistical language modelling. Smoothing is used to adjust the probabilities and make distributions more uniform. Chen and Goodman (Chen and Goodman, 1999) made a complete comparison of most smoothing techniques and found that the modified Kneser-Ney smoothing(equation 3) outperformed others.</Paragraph> <Paragraph position="6"> The process of Pinyin input can be formulated as follows.</Paragraph> <Paragraph position="8"> We assume each Chinese character has only one pronunciation in our experiments.</Paragraph> <Paragraph position="9"> Thus we can use the Viterbi algorithm to find the word sequences to maximize the language model according to Pinyin input.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Prediction by Partial Matching </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Prediction by Partial Matching (PPM)(Cleary </SectionTitle> <Paragraph position="0"> and Witten, 1984; Bell et al., 1990) is a symbolwise compression scheme for adaptive text compression. PPM generates a prediction for each input character based on its preceding characters. The prediction is encoded in form of conditional probability, conditioned on previous context. PPM maintains predictions, computed from the training data, for larger context as well as all shorter con-texts. If PPM cannot predict the character from current context, it uses an escape probability to &quot;escape&quot; another context model, usually of length one shorter than the current context. For novel characters that have never seen before in any length model, the algorithm escapes down to a default &quot;order-1&quot; context model where all possible characters are present.</Paragraph> <Paragraph position="1"> PPM escape method can be considered as an instance of Jelinek-Mercer smoothing. It is defined recursively as a linear interpolation between the nth-order maximum likelihood and the (n-1)th-order smoothed model. Various methods have been proposed for estimating the escape probability. In the following description of each method, e is the escape probability and p(ph) is the conditional probability for symbol ph , given a context. c(ph) is the number of times the context was followed by the symbol ph . n is the number of tokens that have followed. t is the number of types.</Paragraph> <Paragraph position="2"> Method A works by allocating a count of one to the escape symbol.</Paragraph> <Paragraph position="4"> Method B makes assumption that the first occurrence of a particular symbol in a particular context may be taken as evidence of a novel symbol appearing in the context, and therefore does not contribute towards the estimate of the probability of the symbol which it occurred.</Paragraph> <Paragraph position="6"> B, with the distinction that the first observation of a particular symbol in a particular symbol in a particular context also contributes to the probability estimate of the symbol itself. Escape method C is called Witten-Bell smoothing in statistical language modelling. Chen and Goodman (Chen and Goodman, 1999) reported it is competitive on very large training data sets comparing with other smoothing techniques.</Paragraph> <Paragraph position="8"> cation to method B. Whenever a novel event occurs, rather than adding one to the symbol, half is added instead.</Paragraph> <Paragraph position="10"> To illustrate the PPM compression modelling technique, Table 1 shows the model after string dealornodeal has been processed. In this illustration the maximum order is 2 and each prediction has a count c and a prediction probability p. The probability is determined from dealornodeal counts associated with the prediction using escape method D(equation 16). |A |is the size the alphabet which determines the probability for each unseen character.</Paragraph> <Paragraph position="11"> Suppose the character following dealornodeal is o. Since the order-2 context is al and the upcoming symbol o has already seen in this context, the order-2 model is used to encode the symbol. The encoding probability is 1/2. If the next character were i instead of o, it has not been seen in the current order-2 context (al). Then an order-2 escape event is emitted with a probability of 1/2 and the context truncated to l. Checking the order-1 model, the upcoming character i has not been seen in this context, so an order-1 escape event is emitted with a probability of 1/2 and the context is truncated to the null context, corresponding to the order-0 model. As i has not appeared in the string dealornodeal, a final level of escape is emitted with a probability of 7/24 and the i will be predicted with a probability of 1/256 in the order-1, assuming that the alphabet size is 256 for ASCII. Thus i is encoded with a total probability of 12 [?] 12 [?] 724 [?] 1256.</Paragraph> <Paragraph position="12"> In reality, the alphabet size in the order- -1 model may be reduced by the number of characters in the order-0 model as these characters will never be predicted in the order- -1 context. Thus it can be reduced to 249 in this case. Similarly a character that occurs in the higher-order model will never be encoded in the lower-order models. So it is not necessary to reserve the probability space for the character in the lower-order models. This is called &quot;exclusion&quot;, which can greatly improve compression.</Paragraph> <Paragraph position="13"> People Daily (9101) with 792964 Bytes using different compression methods. PPM compression methods are significantly better than practical compression utilities like Unix gzip and compress except escape method A but they are slower during compression. The compression ratesforescapemethodBandDarebothhigher than escape method C. Order-2 model (trigram) is slightly better that order-1 and order-3 models for escape method D.</Paragraph> <Paragraph position="14"> In our experiment we use escape method D to calculate the escape probability as escape method D is slightly better than other escape methods in compressing text although Method B is the best here. Teahan (Teahan et al., 2000) has successfully applied escape method D to segment Chinese text.</Paragraph> </Section> </Section> class="xml-element"></Paper>