File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2808_metho.xml
Size: 14,251 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2808"> <Title>Anomaly Detecting within Dynamic Chinese Chat Text</Title> <Section position="4" start_page="48" end_page="50" type="metho"> <SectionTitle> 3 Anomaly Detection with Standard </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="49" type="sub_section"> <SectionTitle> Chinese Corpora </SectionTitle> <Paragraph position="0"> Chat text exhibits anomalous characteristics in using or forming words. We argue that the anomalous chat text, which is referred as anomaly in this article, can be identified with language models constructed on standard Chinese corpora with some statistical language modeling (SLM) techniques, e.g. trigram model.</Paragraph> <Paragraph position="1"> The problem of anomaly detection can be addressed as follows. Given a piece of anomalous chat text, i.e. },...,,{</Paragraph> <Paragraph position="3"> model )}({ xpLM = , we attempt to recognize W as anomaly by the language model. We propose two approaches to tackle this problem. We design a confidence-based approach to calculate how likely that W fits into the language model.</Paragraph> <Paragraph position="4"> Another approach is designed based on entropy calculation. Entropy method was originally proposed to estimate how good a language model is.</Paragraph> <Paragraph position="5"> In our work we apply this method to estimate how much the constructed language models are able to reflect the corpora properly based on the assumption that the corpora are sound and complete. null Although there exist numerous statistical methods to construct a natural language model, the objective of them is one: to construct a probabilistic distribution model )(xp which fits to the most extent into the observed language data in the corpus. We implement the trigram model and create language models with three Chinese corpora, i.e. People's Daily corpus, Chinese Gigaword and Chinese Pen Treebank. We investigate quality of the language models produced with these corpora.</Paragraph> </Section> <Section position="2" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 3.1 The N-gram Language Models </SectionTitle> <Paragraph position="0"> N-gram model is the most widely used in statistical language modeling nowadays. Without loss of generality we express the probability of a</Paragraph> <Paragraph position="2"> is chosen appropriately to handle the initial condition. The probability of the next</Paragraph> <Paragraph position="4"> of words that have been given so far. With this factorization the complexity of the model grows exponentially with the length of the history. One of the most successful models of the past two decades is the trigram model (n=3) where only the most recent two words of the history are used to condition the probability of the next word.</Paragraph> <Paragraph position="5"> Instead of using the actual words, one can use a set of word classes. Classes based on the POS tags, or the morphological analysis of words, or the semantic information have been tried. Also, automatically derived classes based on some statistical models of co-occurrence have been tried (Brown et. al., 1990). The class model can be generally described as if the classes are non-overlapping. These tri-class models have had higher perplexities than the corresponding trigram model. However, they have led to a reduction in perplexity when linearly combined with the trigram model.</Paragraph> </Section> <Section position="3" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 3.2 The Confidence-based Approach </SectionTitle> <Paragraph position="0"> Given a piece of chat text },...,,{</Paragraph> <Paragraph position="2"> w is obtained with a standard Chinese word segmentation tool, e.g. ICTCLAS. As ICTCLAS is a segmentation tool based on standard vocabulary, it means that some unknown chat terms (e.g., &quot;Jie Ge &quot;) would be broken into several element Chinese words (i.e., &quot;Jie &quot; and &quot;Ge &quot; in the above case). This does not hurt the algorithm because we use trigram in this method. A chat term may produce some anomalous word trigrams which are evidences for anomaly detection.</Paragraph> <Paragraph position="3"> We use non-zero probability for each trigram in this calculation. This is very simple but naive. The calculation seeks to produce a so-called confidence, which reflects how much the given text fits into the training corpus in arranging its element Chinese words. This is enlightened by the observation that the chat terms use element words in anomalous manners which can not be simulated by the training corpus.</Paragraph> <Paragraph position="4"> The confidence-based value is defined as</Paragraph> <Paragraph position="6"> where K denotes the number of trigrams in chat text W and</Paragraph> <Paragraph position="8"> signed probability of the trigram</Paragraph> <Paragraph position="10"> linear interpolation is applied to estimate its probability.</Paragraph> <Paragraph position="11"> We empirically setup a confidence threshold value to determine whether the input text contains chat terms, namely, it is a piece of chat text. The input is concluded to be stand text if its confidence is bigger than the confidence threshold value. Otherwise, the input is concluded to be chat text. The confidence threshold value can be estimated with a training chat text collection.</Paragraph> </Section> <Section position="4" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.3 The Entropy-based Approach </SectionTitle> <Paragraph position="0"> The idea beneath this approach comes from entropy based language modeling. Given a language model, one can use the quantity of entropy to get an estimation of how good the language model (LM) might be. Denote by p the true distribution, which is unknown to us, of a segment of new text x of k words. Then the entropy on a per word basis is defined as V ; ||ln VH [?] for other distributions of the words.</Paragraph> <Paragraph position="1"> Enlightened by the estimation method, we compute the entropy-based value on a per tri-gram basis for the input chat text. Given a standard LM denoted by p ~ which is modeled by trigram, the entropy-value is calculate as</Paragraph> <Paragraph position="3"> where K denotes number of trigrams the input text contains. Our goal is to find how much difference the input text is compared against the LM. Obviously, bigger entropy discloses a piece of more anomalous chat text. An empirical entropy threshold is again estimated on a training chat text collection. The input is concluded to be stand text if its entropy is smaller than the entropy threshold value. Otherwise, the input is concluded to be chat text.</Paragraph> </Section> </Section> <Section position="5" start_page="50" end_page="50" type="metho"> <SectionTitle> 4 Incorporating the Chat Text Corpus </SectionTitle> <Paragraph position="0"> We argue performance of the approaches can be improved when an initial static chat text corpus is incorporated. The chat text corpus provides some basic forms of the anomalous chat text.</Paragraph> <Paragraph position="1"> These forms we observe provide valuable heuristics in the trigram models. Within the chat text corpus, we only consider the word trigrams and POS tag trigrams in which anomalous chat text appears. We thus construct two trigram lists.</Paragraph> <Paragraph position="2"> Probabilities are produced for each trigram according to its occurrence. One chat text example</Paragraph> <Paragraph position="4"> T , if it appears in the chat text corpus, we adjust the confidence and entropy values by incorporating its probability in chat text corpus.</Paragraph> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.1 The Refined Confidence </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> calculation is to decrease confidence of input chat text when chat text trigrams are found.</Paragraph> <Paragraph position="3"> Normally, when a trigram</Paragraph> <Paragraph position="5"> p will be much lower than 1 .</Paragraph> <Paragraph position="6"> By multiplying such a weight, confidence of input chat text can be decreased so that the text can be easily detected.</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.2 The Refined Entropy </SectionTitle> <Paragraph position="0"> Instead of assigning a weight, we introduce the entropy-based value of the input chat text on the chat text corpus, i.e.</Paragraph> <Paragraph position="2"> , to produce a new equation. We denote</Paragraph> <Paragraph position="4"> the entropy calculated with equation (5). Similar to</Paragraph> <Paragraph position="6"> We therefore re-write the entropy-based value calculation as follows.</Paragraph> <Paragraph position="8"> in entropy calculation is to increase the entropy of input chat text when chat text trigrams are found. It can be easily proved that</Paragraph> <Paragraph position="10"> H . As bigger entropy discloses a piece of more anomalous chat text, we believe more anomalous chat texts can be correctly detected with equation (9).</Paragraph> </Section> </Section> <Section position="6" start_page="50" end_page="51" type="metho"> <SectionTitle> 5 Evaluations </SectionTitle> <Paragraph position="0"> Three experiments are conducted in this work.</Paragraph> <Paragraph position="1"> The first experiment aims to estimate threshold values from a real text collection. The remaining experiments seek to evaluate performance of the approaches with various configurations.</Paragraph> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 5.1 Data Description </SectionTitle> <Paragraph position="0"> We use two types of text corpora to train our approaches in the experiments. The first type is standard Chinese corpus which is used to construct standard language models. We use People's Daily corpus, also know as Peking University Corpus (PKU), the Chinese Gigaword (CNGIGA) and the Chinese Penn Treebank (CNTB) in this work. Considering coverage, CNGIGA is the most excellent one. However, PKU and CPT provide more syntactic information in their annotations. Another type of training corpus is chat text corpus. We use NIL corpus described in (Xia et. al., 2005b). In NIL corpus each anomalous chat text is annotated with their attributes.</Paragraph> <Paragraph position="1"> We create four test sets in our experiments.</Paragraph> <Paragraph position="2"> We use the test set #1 to estimate the threshold values of confidence and entropy for our approaches. The values are estimated on two types of trigrams in three corpora. Test set #1 contains 89 pieces of typical Chinese chat text selected from the NIL corpus and 49 pieces of standard Chinese sentences selected from online Chinese news by hand. There is no special consideration that we select different number of chat texts and standard sentences in this test set.</Paragraph> <Paragraph position="3"> The remaining three test sets are used to compare performance of our approaches on test data created in different time periods. The test set #2 is the earliest one and #4 the latest one according to their time stamp. There are 10K sentences in total in test set #2, #3 and #4. In this collection, chat texts are selected from YESKY BBS system (http://bbs.yesky.com/bbs/) which cover BBS text in March and April 2005 (later than the chat text in the NIL corpus), and standard texts are extracted from online Chinese news randomly.</Paragraph> <Paragraph position="4"> We describe the four test sets in Table 1.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 5.2 Experiment I: Threshold Values Esti- </SectionTitle> <Paragraph position="0"> mation This experiment seeks to estimate the threshold values of confidence and entropy for two types of trigrams in three Chinese corpora. We first run the two approaches using only standard Chinese corpora on the 138 sentences in the first test set. We put the calculated values (confidence or entropy) into two arrays. Note that we already know type of each sentence in the first test set. So we are able to select in each array a value that produces the lowest error rate. In this way we obtain the first group of threshold values for our approaches.</Paragraph> <Paragraph position="1"> We incorporate the NIL corpus to the two approaches and run them again. We then produce the second group of threshold values in the same way to produce the first group of values.</Paragraph> </Section> </Section> <Section position="7" start_page="51" end_page="52" type="metho"> <SectionTitle> 5.2.2 Results </SectionTitle> <Paragraph position="0"> The selected threshold values and corresponding error rates are presented in Table 2~5.</Paragraph> <Paragraph position="1"> for the approach using standard Chinese corpora and error rates.</Paragraph> <Paragraph position="2"> for the approach incorporating the NIL corpus and error rates.</Paragraph> <Paragraph position="3"> for the approach incorporating the NIL corpus and error rates.</Paragraph> <Paragraph position="4"> We use the selected threshold values in experiment II and III to detect anomalous chat text within test set #2, #3 and #4.</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 5.3 Experiment II: Anomaly Detection with Three Standard Chinese Corpora </SectionTitle> <Paragraph position="0"> In this experiment, we run the two approaches using the standard Chinese corpora on test set #2.</Paragraph> <Paragraph position="1"> The threshold values estimated in experiment I are applied to help make decisions.</Paragraph> <Paragraph position="2"> Input text can be detected as either standard text or chat text. But we are only interested in how correctly the anomalous chat text is detected. Thus we calculate precision (p), recall (r) and where a is the number of true positives, b the false negatives and c the false positives.</Paragraph> </Section> </Section> class="xml-element"></Paper>