File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1026_metho.xml
Size: 8,963 bytes
Last Modified: 2025-10-06 14:07:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1026"> <Title>Entropy Rate Constancy in Text</Title> <Section position="4" start_page="2" end_page="4" type="metho"> <SectionTitle> 3 Problem Formulation </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> the corpus. Let us consider i to be xed. The random variable we are interested in is Y into two sets. The rst, which we call C</Paragraph> <Paragraph position="4"> , i.e. all the words from the preceding sentences. The remaining set, which we call L</Paragraph> <Paragraph position="6"> be empty sets. We can now write our variable</Paragraph> <Paragraph position="8"> stays constant for all i. By the de nition of relative mutual information between X</Paragraph> <Paragraph position="10"> where the last term is the mutual information between the word and context given the sentence. As i increases, so does the set C</Paragraph> <Paragraph position="12"> the other hand, increases until we reach the end of the sentence, and then becomes small again.</Paragraph> <Paragraph position="13"> Intuitively, we expect the mutual information at, say, word k of each sentence (where L i has the same size for all i) to increase as the sentence number is increasing. By our hypothesis we then expect H(X</Paragraph> <Paragraph position="15"> ) to increase with the sentence number as well.</Paragraph> <Paragraph position="16"> Current techniques are not very good at estimating H(Y i ), because we do not have a very good model of context, since this model must be mostly semantic in nature. We have shown, however, that if we can instead estimate</Paragraph> <Paragraph position="18"> ) and show that it increases with the sentence number, we will provide evidence to support the constancy rate principle.</Paragraph> <Paragraph position="19"> The latter expression is much easier to estimate, because it involves only words from the beginning of the sentence whose relationship is largely local and can be successfully captured through something as simple as an n-gram model.</Paragraph> <Paragraph position="20"> We are only interested in the mean value of</Paragraph> <Paragraph position="22"> sentence. This number is equal to</Paragraph> <Paragraph position="24"> which reduces the problem to the one of estimating the entropy of a sentence.</Paragraph> <Paragraph position="25"> We use three di erent ways to estimate the</Paragraph> <Paragraph position="27"> ) directly, using a non-parametric estimator. We estimate the entropy for the beginning of each sentence. This</Paragraph> <Paragraph position="29"> i.e. ignores not only the context, but also the local syntactic information.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4Results 4.1 N-gram </SectionTitle> <Paragraph position="0"> N-gram models make the simplifying assumption that the current word depends on a constant number of the preceding words (we use three). The probability model for sentence S thus looks as follows:</Paragraph> <Paragraph position="2"> To estimate the entropy of the sentence S,we compute log P(S). This is in fact an estimate of cross entropy between our model and true distribution. Thus we are overestimating the entropy, but if we assume that the overestimation error is more or less uniform, we should still see our estimate increase as the sentence number increases.</Paragraph> <Paragraph position="3"> Penn Treebank corpus (Marcus et al., 1993) sections 0-20 were used for training, sections 2124 for testing. Each article was treated as a separate text, results for each sentence number were grouped together, and the mean value reported on Figure 1 (dashed line). Since most articles are short, there are fewer sentences available for larger sentence numbers, thus results for large sentence numbers are less reliable.</Paragraph> <Paragraph position="4"> The trend is fairly obvious, especially for small sentence numbers: sentences (with no context used) get harder as sentence number increases, i.e. the probability of the sentence given the model decreases.</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.2 Parser Model </SectionTitle> <Paragraph position="0"> We also computed the log-likelihood of the sentence using a statistical parser described in</Paragraph> <Paragraph position="2"> . The probability model for sentence S with parse tree T is (roughly):</Paragraph> <Paragraph position="4"> where parents(x) are words which are parents of node x in the the tree T. This model takes into account syntactic information present in the sentence which the previous model does not.</Paragraph> <Paragraph position="5"> The entropy estimate is again log P(S). Overall, these estimates are lower (closer to the true entropy) in this model because the model is closer to the true probability distribution. The same corpus, training and testing sets were used. The results are reported on Figure 1 (solid line). The estimates are lower (better), but follow the same trend as the n-gram estimates.</Paragraph> </Section> <Section position="3" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 4.3 Non-parametric Estimator </SectionTitle> <Paragraph position="0"> Finally we compute the entropy using the estimator described in (Kontoyiannis et al., 1998).</Paragraph> <Paragraph position="1"> The estimation is done as follows. Let T be our training corpus. Let S = fw . We compute such estimates for many rst sentences, second sentences, etc., and take the average.</Paragraph> <Paragraph position="2"> This parser does not proceed in a strictly left-to-right fashion, but this is not very important since we estimate entropy for the whole sentence, rather than individual words For this experiment we used 3 million words of the Wall Street Journal (year 1988) as the training set and 23 million words (full year 1987) as the testing set . The results are shown on Figure 2. They demonstrate the expected behavior, except for the strong abnormality on the second sentence. This abnormality is probably corpusspeci c. For example, 1.5% of the second sentences in this corpus start with words \the terms were not disclosed&quot;, which makes such sentences easy to predict and decreases entropy.</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.4 Causes of Entropy Increase </SectionTitle> <Paragraph position="0"> We have shown that the entropy of a sentence (taken without context) tends to increase with the sentence number. We now examine the causes of this e ect.</Paragraph> <Paragraph position="1"> These causes may be split into two categories: lexical (which words are used) and non-lexical (how the words are used). If the e ects are entirely lexical, we would expect the per-word entropy of the closed-class words not to increase with sentence number, since presumably the same set of words gets used in each sentence.</Paragraph> <Paragraph position="2"> For this experiment we use our n-gram estimator as described in Section 4.2. We evaluate the per-word entropy for nouns, verbs, determiners, and prepositions. The results are given in Figure 3 (solid lines). The results indicate that entropy of the closed class words increases with sentence number, which presumably means that non-lexical e ects (e.g. usage) are present.</Paragraph> <Paragraph position="3"> We also want to check for presence of lexical e ects. It has been shown by Kuhn and Mohri (1990) that lexical e ects can be easily captured by caching. In its simplest form, caching involves keeping track of words occurring in the previous sentences and assigning for each word w a caching probability P C(w) is the number of times w occurs in the previous sentences. This probability is then mixed with the regular probability (in our case This is not the same training set as the one used in two previous experiments. For this experiment we needed a larger, but similar data set where was picked to be 0.1. This new probability model is known to have lower entropy. More complex caching techniques are possible (Goodman, 2001), but are not necessary for this experiment.</Paragraph> <Paragraph position="4"> Thus, if lexical e ects are present, we expect the model that uses caching to provide lower entropy estimates. The results are given in Figure 3 (dashed lines). We can see that caching gives a signi cant improvement for nouns and a small one for verbs, and gives no improvement for the closed-class parts of speech. This shows that lexical e ects are present for the open-class parts of speech and (as we assumed in the previous experiment) are absent for the closed-class parts of speech. Since we have proven the presence of the non-lexical e ects in the previous experiment, we can see that both lexical and non-lexical e ects are present.</Paragraph> </Section> </Section> class="xml-element"></Paper>