File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1217_metho.xml

Size: 5,041 bytes

Last Modified: 2025-10-06 14:15:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1217">
  <Title>I</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Entropy Estimation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Background
</SectionTitle>
      <Paragraph position="0"> English, as is well-known, is very predictable. Fluent English readers can confirm this for themselves by guessing which letter comes next in a word be~aming psyc-. Experiments by (Shannon, 1951) indicate that most readers can guess more than half of the letters in running text based on their expert knowledge of the lexicon, structure, and semantics of English.</Paragraph>
      <Paragraph position="1"> This notion of predictability, as well as the associated concepts of complexity, compressiveness, and randomness, can be mathematically modelled using information entropy. As developed by (Shannon,  1948), the entropy of a (stationary, ergodic) message source is the amount of information, typically measured in bits (yes/no questions), required to describe the successive messages emitted by that source to a recipient. As the set of possible messages becomes larger, or the distribution of messages becomes less predictable, the entropy of the source increases correspondingly, in accordance with Shannon's equation: null</Paragraph>
      <Paragraph position="3"> where P is (the probability distribution of) a source capable of sending any of the messages 1, 2,..., N, each with some probability Pi. (For continuous distributions, simply replace the summation with the appropriate integral.) An important aspect of this brief description has significant typological and taxonomic implications.</Paragraph>
      <Paragraph position="4"> Against what is the predictability of the distribution measured? The second term in the above equation is a measure of the efficiency of the representation of message i (obviously, more frequent messages should be made shorter for maximal efficiency, an observation often attributed to Zipf), based on our estimate of the frequency with which i is transmitted. Therefore, we can generalize equation 1 to</Paragraph>
      <Paragraph position="6"> where Q is a different distribution representing our best estimate of the true distribution P. This value (called the cross-entropy) achieves a minimum when P = Q, and H(P, P) = H(P). The difference between/:/and H , the so-called Kullback-Leibler divergence, can be taken as a measurement of the degree of similarity between P and Q.1 For further elaboration on this point, the reader is referred to the excellent treatment in (Bishop, 1995).</Paragraph>
      <Paragraph position="7"> This technique lends itself to a measurement of similarity between two different sources, by estimating the distributional parameters and calculating their cross-entropy.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Method
</SectionTitle>
      <Paragraph position="0"> Obviously, much research has been done in the proper development of distributional models of English (or other languages) and in the efficient estimation of the probability distribution; (Brown et al., 1N.b. this is not a &amp;quot;distance metric&amp;quot; in the formal sense of the word (it's not symmetric, for one thing), but can be thought of as a distance for these purposes.</Paragraph>
      <Paragraph position="1"> 1992) calculate the entropy of a statistical model of English that was produced by training a computer on literally billions of observations comprising a huge corpus of written English. (Wyner, in press) has suggested that one can determine the entropy to nearly as good accuracy based on much smaller sample sizes, but it remains an open research question how much text is actually needed. At billions of observations per test, it is obviously impractical to determine document-level properties (such as, for instance, authorship, register, difficulty of reading, or even the language in which a novel document is written), but if the tests can be made sufficiently sensitive to work with small texts, tests like this may be practical.</Paragraph>
      <Paragraph position="2"> (Farach et al., 1995; Wyner, in press) describe a novel algorithm for entropy estimation for which they claim very fast convergence time; using no more than about five pages of text, they can achieve nearly the same accuracy as (Brown et al., 1992). The heart of this technique is a measurement of &amp;quot;match length within a database.&amp;quot; Wyner defines the match length Ln(x) of a sequence (xl, x2,..., xn, xn+x,...) as the length of the the longest prefix of the sequence (xn+x,...) that matches a contiguous substring of (zl,z2,... ,xn), and proves that this converges in the limit to the value ~ as n increases.</Paragraph>
      <Paragraph position="3"> A simple example should make this more clear : we consider for a moment the phrase</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML