File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/j92-1002_metho.xml
Size: 11,136 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="J92-1002"> <Title>An Estimate of an Upper Bound for the Entropy of English</Title> <Section position="3" start_page="32" end_page="35" type="metho"> <SectionTitle> 3. The Language Model </SectionTitle> <Paragraph position="0"> In this section, we describe our language model. The model is very simple: it captures the structure of English only through token trigram frequencies. Roughly speaking, the model estimates the probability of a character sequence by dissecting the sequence into tokens and spaces and computing the probability of the corresponding token sequence. The situation is slightly more complicated than this since, for a fixed token vocabulary, some character sequences will not have any such dissection while others will have several. For example, the sequence abc xyz might not have any dissection while the sequence bedrock might be dissected as one token or as two tokens without an intervening space.</Paragraph> <Paragraph position="1"> We address the difficulty of sequences that cannot be dissected by introducing an unknown token that can account for any spelling. We address the problem of multiple Computational Linguistics Volume 18, Number 1 dissections by considering the token sequences to be hidden. The model generates a sequence of characters in four steps: 1. It generates a hidden string of tokens using a token trigram model. 2. It generates a spelling for each token.</Paragraph> <Paragraph position="2"> 3. It generates a case for each spelling.</Paragraph> <Paragraph position="3"> 4. It generates a spacing string to separate cased spellings from one another. The final character string consists of the cased spellings separated by the spacing strings.</Paragraph> <Paragraph position="4"> The probability of the character string is a sum over all of its dissections of the joint probability of the string and the dissection:</Paragraph> <Paragraph position="6"/> <Section position="1" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 3.1 The Token Trigram Model </SectionTitle> <Paragraph position="0"> The token trigram model is a second-order Markov model that generates a token string tit2.., tn by generating each token ti, in turn, given the two previous tokens ti-1 and ti-2. Thus the probability of a string is</Paragraph> <Paragraph position="2"> The conditional probabilities Mtoken(t 3 \[ tit2) are modeled as a weighted average of four estimators fi</Paragraph> <Paragraph position="4"> where the weights ,~i satisfy ~ ),i = 1 and /~i ~ 0.</Paragraph> <Paragraph position="5"> The estimators fi and the weights ;~i are determined from the training data using a procedure that is explained in detail by Jelinek and Mercer (1980). Basically, the training data are divided into a large, primary segment and a smaller, held-out segment. The estimators fi are chosen to be the conditional frequencies in the primary segment, while the smoothing weights )~i are chosen to fit the combined model to the held-out segment. In order to decrease the freedom in smoothing, the ,~i are constrained to depend on (tit2) only through the counts c(tlt2) and c(t2) in the primary training segment. When c(tlt2) is large, we expect )~3(ht2) to be close to 1, since in this case the trigram frequency in the primary segment should be a reliable estimate of the Brown et al. An Estimate of an Upper Bound for the Entropy of English frequency in the held-out segment. Similarly, when c(tlt2) is small, but c(t2) is large, we expect/k3(tlt2) to be close to 0 and &2(tit2) to be close to 1.</Paragraph> <Paragraph position="6"> The token vocabulary consists of 1. 293,181 spellings, including a separate entry for each punctuation character; 2. a special unknown_token that accounts for all other spellings; 3. a special sentenced;oundary_token that separates sentences.</Paragraph> </Section> <Section position="2" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 3.2 The Spelling Model </SectionTitle> <Paragraph position="0"> The spelling model generates a spelling $1s2... Sk given a token. For any token other than the unknown_token and sentence_boundary_token, the model generates the spelling of the token. For the sentence_boundary_token, the model generates the null string. Finally, for the unknown_token, the model generates a character string by first choosing a length k according to a Poisson distribution, and then choosing k characters independently and uniformly from the printable ASCII characters. Thus</Paragraph> </Section> <Section position="3" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 3.3 The Case Model </SectionTitle> <Paragraph position="0"> The case model generates a cased spelling given a token, the spelling of the token, and the previous token. For the unknown_token and sentence_boundary_token, this cased spelling is the same as the spelling. For all other tokens, the cased spelling is obtained by modifying the uncased spelling to conform with one of the eight possible patterns L + U + UL + ULUL + ULLUL + UUL + UUUL + LUL + Here U denotes an uppercase letter, L a lowercase letter, U + a sequence of one or more uppercase letters, and L + a sequence of one or more lowercase letters. The case pattern only affects the 52 uppercase and lowercase letters.</Paragraph> <Paragraph position="1"> The case pattern C for a token t is generated by a model of the form:</Paragraph> <Paragraph position="3"> Here b is a bit that is 1 if the previous token is the sentence_boundary_token and is 0 otherwise. We use b to model capitalization at the beginning of sentences.</Paragraph> </Section> <Section position="4" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 3.4 The Spacing Model </SectionTitle> <Paragraph position="0"> The spacing model generates the spacing string between tokens, which is either null, a dash, an apostrophe, or one or more blanks. It is generated by an interpolated model similar to that in Equation (19). The actual spacing that appears between two tokens should depend on the identity of each token, but in our model we only consider the dependence on the second token. This simplifies the model, but still allows it to do Computational Linguistics Volume 18, Number 1 a good job of predicting the null spacing that precedes many punctuation marks. For strings of blanks, the number of blanks is determined by a Poisson distribution.</Paragraph> </Section> <Section position="5" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 3.5 The Entropy Bound </SectionTitle> <Paragraph position="0"> According to the paradigm of Section 2.2 (see Equation (13)), we can estimate an upper bound on the entropy of characters in English by calculating the language model probability M(character_string) of a long string of English text. For a very long string it is impractical to calculate this probability exactly, since it involves a sum over the different hidden dissections of the string. However, for any particular dissection M(character-string) > M(character-string, dissection). Moreover, for our model, a straightforward partition of a character string into tokens usually yields a dissection for which this inequality is approximately an equality. Thus we settle for the slightly less sharp bound</Paragraph> <Paragraph position="2"> where dissection is provided by a simple finite state tokenizer. By Equation (15), the joint probability M(characterstring, dissection) is the product of four factors. Consequently, the upper bound estimate (20) is the sum of four entropies,</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="4" start_page="35" end_page="37" type="metho"> <SectionTitle> 4. The Data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 4.1 The Test Sample </SectionTitle> <Paragraph position="0"> We used as a test sample the Brown Corpus of English text (Kucera and Francis 1967).</Paragraph> <Paragraph position="1"> This well-known corpus was designed to represent a wide range of styles and varieties of prose. It consists of samples from 500 documents, each of which first appeared in print in 1961. Each sample is about 2,000 tokens long, yielding a total of 1,014,312 tokens (according to the tokenization scheme used in reference \[Kucera and Francis 1967\]).</Paragraph> <Paragraph position="2"> We used the Form C version of the Brown Corpus. Although in this version only proper names are capitalized, we modified the text by capitalizing the first letter of every sentence. We also discarded paragraph and segment delimiters.</Paragraph> </Section> <Section position="2" start_page="35" end_page="36" type="sub_section"> <SectionTitle> 4.2 The Training Data </SectionTitle> <Paragraph position="0"> We estimated the parameters of our language model from a training text of 583 million tokens drawn from 18 different sources. We emphasize that this training text does not include the test sample. The sources of training text are listed in Table 1 and include text from: 1.</Paragraph> <Paragraph position="1"> .</Paragraph> <Paragraph position="2"> several newspaper and news magazine sources: the Associated Press; the United Press International (UPI); the Washington Post; and a collection of magazines published by Time Incorporated; two encyclopedias: Grolier's Encyclopedia and the McGraw-Hill Encyclopedia of Science and Technology; 3. two literary sources: a collection of novels and magazine articles from the American Printing House for the Blind (APHB) and a collection of Sherlock Holmes novels and short stories; 4. several legal and legislative sources: the 1973-1986 proceedings of the Canadian parliament; a sample issue of the Congressional Record; and the depositions of a court case involving IBM; 5. office correspondence (OC) from IBM and from Amoco; 6. other miscellaneous sources: Bartlett's Familiar Quotations, the Chicago Manual of Style, and The World Almanac and Book of Facts.</Paragraph> </Section> <Section position="3" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 4.3 The Token Vocabulary </SectionTitle> <Paragraph position="0"> We constructed the token vocabulary by taking the union of a number of lists including: null 1. two dictionaries; 2. two lists of first and last names: a list derived from the IBM on-line phone directory, and a list of names we purchased from a marketing company; 3. a list of place names derived from the 1980 U.S. census; 4. vocabulary lists used in IBM speech recognition and machine translation experiments.</Paragraph> <Paragraph position="1"> The resulting vocabulary contains 89.02% of the 44,177 distinct tokens in the Brown Corpus, and covers 99.09% of 1,014,312-token text. The twenty most frequently occurring tokens in the Brown Corpus not contained in our vocabulary appear in Table 2. The first two, *J and *F, are codes used in the Brown Corpus to denote formulas and special symbols.</Paragraph> </Section> </Section> class="xml-element"></Paper>