File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1210_metho.xml
Size: 8,435 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1210"> <Title>Finding Structure via Compression</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Entropic Chunking </SectionTitle> <Paragraph position="0"> A predictive model M is a model which, when presented with a sequence of symbols s, is able to make a prediction about the next symbol in the sequence in the form of a probability distribution over the alphabet E (for the purposes of this investigation, is the set of ASCII characters). We assume that the estimated probability distribution is smoothed to avoid the zero-frequency problem. The specifics of the model are unimportant; the methods presented in this paper are intended to be generic, but it is clear that n-th order Markov models, for n less than the length of s, would qualify.</Paragraph> <Paragraph position="1"> The information of a symbol w with respect to a statistical model M and a context s is defined in Equation 1. Intuitively we may think of the information as the surprise the model experiences upon receipt of the symbol w; it is low if the model's expectations are vindicated, high if they are erroneous (Shannon and Weaver, 1949).</Paragraph> <Paragraph position="3"> The entropy of a language model, defined in Equation 2, is the expected value of the information.</Paragraph> <Paragraph position="4"> The entropy is a measure of the model's uncertainty about the future; it will be low if the model expects one particular symbol to occur with a high probability, and it increases as the estimated probability distribution approaches the uniform.</Paragraph> <Paragraph position="6"> If one monitors the instantaneous entropy of a language model as it scans across an English text, one generally finds that regions of high entropy correspond with word boundaries (Alder, 1988). This is convincingly demonstrated by Figure 1, which plots the entropy of a second-order Markov model across the first sentence of &quot;A Scandal in Bohemia&quot;, by Sir Arthur Conan Doyle. The training corpus used in this example was 3.5 megabytes of Sherlock Holmes stories, minus the testing sentence. 1 Segmentation is a matter of chunking the data whenever the instantaneous entropy exceeds some threshold value (Wolff, 1977). A chunk is merely a string of symbols which constitute a higher-level lexeme. Throughout this paper a chunking threshold of 1/2 log 2 IIE\[\[ bits is used, although this is almost certainly not an optimal value. The problem of finding a good threshold automatically warrants investigation. null The Sherlock Holmes corpus was segmented in this way. Table 1 lists, in decreasing order of frequency, the most common chunks found in the text.</Paragraph> <Paragraph position="7"> The fact that they agree rather well with the most frequent words in the English language is encouraging. null Sherlock Holmes corpus.</Paragraph> <Paragraph position="8"> A total of 70171 distinct chunks were found in the corpus. Of these, a massive 66821 chunks occurred ten times or less--these chunks were discarded due to their infrequency (all anomalous chunks, such as &quot;halfc&quot; and &quot;ichth', occurred in this group). The majority of the remaining 3350 chunks were found to be valid English words. Those that weren't were strings of two or more English words, such as &quot;inAtheA&quot;, &quot;itAwasA&quot; and &quot;doAyouAthinkAthatA&quot;. The previous experiment was repeated using a version of the Sherlock Holmes corpus which had many clues to word boundaries removed; all characters were replaced with their uppercase equivalents, and whitespace and punctuation symbols were deleted. Many good chunks were discovered, such as &quot;THE&quot;, &quot;TO&quot;, &quot;WAS&quot;, &quot;OFTHE&quot;, &quot;HAVEBEEN&quot; and &quot;POLICE&quot;. However, anomalous chunks were prevalent, with &quot;REWAS&quot; and &quot;STO&quot; occurring as frequently as the chunks a human being would identify as English words.</Paragraph> <Paragraph position="9"> Even so, entropic chunking provides a technique for discovering structure which makes very few assumptions about the information that the data contains. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Finding Separator Symbols </SectionTitle> <Paragraph position="0"> In natural language text, words are typically separated by whitespace. Entropic chunking may be used to discover this automatically, by recording which symbols occur immediately prior to a large jump in entropy.</Paragraph> <Paragraph position="1"> Table 2 lists, in decreasing order of frequency, separator symbols discovered in the Sherlock Holmes</Paragraph> <Paragraph position="3"> den jumps in entropy, which agrees with our expectations. The - symbol occurs within hyphenated words, which were usually broken up into their constituents, while the &quot; symbol occurs as a chunk separator whenever two pieces of dialogue appear backto-back. The remaining probability mass was distributed over 43 symbols, which were discarded as anomalies.</Paragraph> <Paragraph position="4"> lock Holmes corpus.</Paragraph> <Paragraph position="5"> Once one or more separator symbols have been found, traditional parsing techniques may be used to segment the text.</Paragraph> <Paragraph position="6"> Many data sequences simply will not have separator symbols. For example, a database may store fields in a file based on their bit length only. In such situations entropic chunking must be used if no prior assumptions about the structure of the data are to be made.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data Compression </SectionTitle> <Paragraph position="0"> In order to test the value of adding chunks to a language model's alphabet, we conducted a simple experiment. The Sherlock Holmes corpus was divided into three non-overlapping parts, each of roughly a megabyte in size. These three corpora were used for training, chunking and testing respectively.</Paragraph> <Paragraph position="1"> A standard PPMC model was inferred from the training corpus and used to segment the chunking corpus (Moffat, 1990). The most common chunk was then added to the alphabet of the PPMC model in a process we refer to as the upwrite (Hutchens, 1997).</Paragraph> <Paragraph position="2"> Evaluation was performed by measuring the perplexity of the PPMC model with respect to the testing corpus (Jelinek and Lafferty, 1991). The perplexity, defined in Equation 3 for a corpus of N symbols, is a monotone function of the average information of the model, and is therefore a measure of compression. null</Paragraph> <Paragraph position="4"> It should be mentioned that PPMC is usually used in adaptive data compression systems. In our experiment we used it in a non-adaptive way; the model was inferred from one corpus and tested on another.</Paragraph> <Paragraph position="5"> Although true compression systems avoid this two-pass approach due to the expense of transmitting the model, evaluation is performed this way in the speech recognition literature.</Paragraph> <Paragraph position="6"> An iteration of this process was used to produce the plot shown in Figure 2. Perplexity is given in units of characters rather than Symbols--this is necessary because the alphabet size increases with every chunk added.</Paragraph> <Paragraph position="7"> A minimum perplexity of 4.48 characters was attained after 154 chunks had been added to the model's alphabet. This represents a 9.5% reduction of the model's initial perplexity of 4.95 characters, equivalent to a 6.2% improvement in compression performance. Although this result is by no means ground-breaking, webelieve that it illustrates the advantage of chunking.</Paragraph> <Paragraph position="8"> The initial reduction in perplexity is rapid, as the first chunks discovered correspond to the most frequent English words. The continued addition of chunks reduces the perplexity further, discounting minor local variations. We expect that the performance of the model will degrade once too many chunks are added to its alphabet, but the experiment didn't proceed long enough to make this apparent.</Paragraph> </Section> class="xml-element"></Paper>