File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3205_metho.xml
Size: 18,983 bytes
Last Modified: 2025-10-06 14:10:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3205"> <Title>Exploring variant definitions of pointer length in MDL</Title> <Section position="3" start_page="32" end_page="35" type="metho"> <SectionTitle> 2 Variant definitions of pointers </SectionTitle> <Paragraph position="0"> In order to simplify the following theoretical discussion, we temporarily abstract away from the complexity of a full-blown model of morphology. Given a set of N stems and their distribution, we consider the general problem of pointing to a subset of M stems (with 0 < M [?] N), first by means of &quot;standard&quot; lists of pointers, then by means of polarized ones, and finally by means of binary strings.</Paragraph> <Section position="1" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 2.1 Expected length of lists of pointers </SectionTitle> <Paragraph position="0"> Let t denote a set of N stems; we assume that the length of a pointer to a specific stem t [?] t is its inverse log probability [?]logpr(t).2 Now, let {M} denote the set of all subsets of t that contain exactly 0 < M [?] N stems. The description length of a list of pointers to a particular subset u [?] {M} is defined as the sum of the lengths of the M pointers it contains, plus a small cost of for specifying the list structure itself, defined as l(M) := 0 if M = 0 and</Paragraph> <Paragraph position="2"> The expected length of a pointer is equal to the entropy over the distribution of stems:</Paragraph> <Paragraph position="4"> Thus, the expected description length of a list of pointers to M stems (over all subsets u [?] {M}) is:</Paragraph> <Paragraph position="6"> This value increases as a function of both the number of stems which are pointed to and the entropy over the distribution of stems. Since 0 [?] hstems [?] logN, the following bounds hold:</Paragraph> <Paragraph position="8"> 2Here and throughout the paper, we use the notation logx to refer to the binary logarithm of x; thus entropy and other information-theoretic quantities are expressed in terms of bits.</Paragraph> <Paragraph position="9"> 3Cases where the argument of this function can have the value 0 will arise in the next section.</Paragraph> </Section> <Section position="2" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 2.2 Polarization </SectionTitle> <Paragraph position="0"> Consider a set of N = 3 equiprobable stems, and suppose that we need to specify that a given morpho-phonological rule applies to one of them. In this context, a list with a single pointer to a stem requires log1 [?] log 13 = 1.58 bits. Suppose now that the rule is more general and applies to two of the three stems. The length of the new list of pointers is thus log2[?]2log 13 = 4.17 bits. It appears that for such a general rule, it is more compact to list the stems to which it does not apply, and mark the list with a flag that indicates the &quot;negative&quot; meaning of the pointers. Since the flag signals a binary choice (either the list points to stems that undergo the rule, or to those that do not), log2 = 1 bit suffices to encode it, so that the length of the new list is 1.58 + 1 = 2.58 bits.</Paragraph> <Paragraph position="1"> We propose to use the term polarized to refer to lists of pointers bearing a such flag. If it is useful to distinguish between specific settings of the flag, we may speak of positive versus negative lists of pointers (the latter being the case of our last example). The expected description length of a polarized list of M pointers is:</Paragraph> <Paragraph position="3"> (2) From (1) and (2), we find that in general, the expected gain in description length by polarizing a list of M pointers is:</Paragraph> <Paragraph position="5"> Thus, if the number of stems pointed to is lesser than or equal to half the total number of stems, using a polarized list rather than a non-polarized one means wasting exactly 1 bit for encoding the superfluous flag. If the number of stems pointed to is larger than that, we still pay 1 bit for the flag, but the reduced number of pointers results in an expected saving of l(M) [?] l(N [?] M) bits for the list structure, plus (2M [?]N)*hstems bits for the pointers themselves.</Paragraph> <Paragraph position="6"> Now, let us assume that we have no information regarding the number M of elements which are pointed to, i.e. that it has a uniform distribution between 1 and N (M [?] U[1,N]). Let us further assume that stems follow a Zipfian distribution of parameter s, so that the probability of the k-th most frequent stem is defined as:</Paragraph> <Paragraph position="8"> where HN,s stands for the harmonic number of order N of s. The entropy over this distribution is:</Paragraph> <Paragraph position="10"> Armed with these assumptions, we may now compute the expected description length gain of polarization (over all values of M) as a function of N</Paragraph> <Paragraph position="12"> Figure 1 shows the gain calculated for N = 1, 400, 800, 1200, 1600 and 2000, and s = 0, 1, 2 and 10. In general, it increases with N, with a slope that depends on s: the greater the value of s, the lesser the entropy over the distribution of stems; since the entropy corresponds to the expected length of a polarized list of pointers, or a binary string. of a pointer, its decrease entails a decrease in the number of bits that can be saved by using polarized lists (which generally use less pointers). However, even for an aberrantly skewed distribution of stems4, the expected gain of polarization remains positive.</Paragraph> <Paragraph position="13"> Since the value of s is usually taken to be slightly greater than 1 for natural languages (Mandelbrot, 1953), it seems that polarized lists generally entail a considerable gain in description length.</Paragraph> </Section> <Section position="3" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 2.3 Binary strings </SectionTitle> <Paragraph position="0"> Consider again the problem of pointing to one out of three equiprobable stems. Suppose that the list of stems is ordered, and that we want to point to the first one, for instance. An alternative to the recourse to a list of pointers consists in using a binary string (in this case 100) where the i-th symbol is set to 1 (or +) if the i-th stem is being pointed to, and to 0 (or -) otherwise. Figure 2 gives a schematic view of these two ways of pointing to items.</Paragraph> <Paragraph position="1"> There are two main differences between this method and the previous one. On the one hand, the number of symbols in the string is constant and equal to the total number N of stems, regardless of the number M of stems that are pointed to. On the other hand, the compressed length of the string depends on the distribution of symbols in it, and not on the distribution of stems. Thus, by comparison with the description length of a list of pointers, there is a loss due to the larger number of encoded symbols, and a gain due to the use of an encoding specifically 4In the case s = 10, the probability of the most frequent stem is .999 for N = 2000.</Paragraph> <Paragraph position="2"> tailored for the relevant distribution of pointed versus &quot;unpointed&quot; elements.</Paragraph> <Paragraph position="3"> The entropy associated with a binary string is entirely determined by the number of 1's it contains, i.e. the number M of stems which are pointed to, and the length N of the string:</Paragraph> <Paragraph position="5"> It is maximal and equal to N bits when M = N2 , and minimal and equal to 0 when M = N, i.e. when all stems have a pointer on them. Notice that binary strings are intrinsically polarized, so that interverting 0's and 1's results in the same description length regardless of their distribution.5 The question naturally arises, under which conditions would binary strings be more or less compact than polarized lists of pointers. If we assume again that the distribution of the number of elements pointed to is uniform and the distribution of stems is Zipfian of parameter s, (2) and (3) justify the following expression for the expected description length gain by using binary strings rather than polarized lists (as a function of N and s):</Paragraph> <Paragraph position="7"> Figure 3 shows the gain calculated for N = 1, 400, 800, 1200, 1600 and 2000, and s = 0, 1, 2 and 3.</Paragraph> <Paragraph position="8"> For s small, i.e. when the entropy over the distribution of stems is greater or not much lesser than that of natural languages, the description length of binary strings is considerably lesser than that of polarized lists. The difference decreases as s increases, 5As one the reviewers has indicated to us, the binary strings approach is actually very similar to the method of combinatorial codes described by (Rissanen, 1989). This method consists in pointing to one among x0NMx1 possible combinations of M stems out of N. Under the assumption that these combinations have a uniform probability, the cost for pointing to M stems is logx0NMx1 bits, which is in general slightly lesser than the description length of the corresponding binary string (the difference being maximal for M = N/2, i.e. when the binary string encoding cannot take advantage of any compression).</Paragraph> <Paragraph position="9"> ing binary strings rather than polarized lists under the assumption that M [?] U[1,N].</Paragraph> <Paragraph position="10"> until at some point (around s = 2), the situation reverses and polarized lists become more compact. In both cases, the trend increases with the number N of stems (within the range of values observed).</Paragraph> <Paragraph position="11"> By contrast, it is instructive to consider a case where the distribution of the number of elements pointed to departs from uniformity. For instance, we can make the assumption that M follows a binomial distribution (M [?] B[N,p]).6 Under this assumption (and, as always, that of a Zipfian distribution of stems), the expected description length gain by using binary strings rather than polarized lists is:</Paragraph> <Paragraph position="13"> Letting N and s vary as in the previous computation, we set the probability for a stem to have a pointer on it to p = 0.01, so that the distribution of pointed versus &quot;unpointed&quot; elements is considerably skewed.7 6This model predicts that most of the time, the number M of elements pointed to is equal to N * p (where p denotes the probability for a stem to have a pointer on it), and that the probability pr(M) of other values of M decreases as they diverge from N *p.</Paragraph> <Paragraph position="14"> ing binary strings rather than polarized lists under the assumption that M [?] B[N,0.01].</Paragraph> <Paragraph position="15"> As shown on figure 4, under these conditions, the absolute value of the gain of using binary strings gets much smaller in general, and the value of s for which the gain becomes negative for N large gets close to 1 (for this particular value, it becomes positive at some point between N = 1200 and N = 1600).</Paragraph> <Paragraph position="16"> Altogether, under the assumptions that we have used, these theoretical considerations suggest that binary strings generally yield shorter description lengths than polarized lists of pointers. Of course, data for which these assumptions do not hold could arise. In the perspective of unsupervised learning, it would be particularily interesting to observe that such data drive the learner to induce a different model depending on the representation of pointers being adopted.</Paragraph> <Paragraph position="17"> It should be noted that nothing prevents binary strings and lists of pointers from coexisting in a single system, which would select the most compact one for each particular case. On the other hand, it is a logical necessity that all lists of pointers be of the same kind, either polarized or not.</Paragraph> </Section> </Section> <Section position="4" start_page="35" end_page="38" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> In the previous section, by assuming frequencies of stems and possible distributions of M (the number of stems per signature), we have explored theoretically the differences between several encoding methods in the MDL framework. In this section, we apply these methods to the problem of suffix discovery in natural language corpora, in order to verify the theoretical predictions we made previously. Thus, the purpose of these experiments is not to state that one encoding is preferable to the others; rather, we want to answer the three following questions: 1. Are our assumptions on the frequency of stems and size of signatures appropriate for natural language corpora? 2. Given these assumptions, do our theoretical analyses correctly predict the difference in description length of two encodings? 3. What is the relationship between the gain in description length and the size of the corpus?</Paragraph> <Section position="1" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 3.1 Experimental methodology </SectionTitle> <Paragraph position="0"> In this experiment, for the purpose of calculating distinct description lengths while using different encoding methods, we modified Linguistica8 by implementing list of pointers and binary strings as alternative means to encode the pointers from signatures to their associated stems9. As a result, given a set of signatures, we are able to compute a description length for each encoding methods.</Paragraph> <Paragraph position="1"> Within Linguistica, the morphology learning process can be divided into a sequence of heuristics, each of which searches for possible incremental modifications to the current morphology. For example, in the suffix-discovery procedure, ten heuristics are carried out successively; thus, we have a distinct set of signatures after applying each of the ten heuristics. Then, for each of these sets, we encode the pointers from each signature to its corresponding stems in three rival ways: as a list of pointers (polarized or not), as traditionally understood, and as a binary string. This way, we can compute the total description length of the signature-stem-linkage for each of the ten sets of signatures and for each of three two ways of encoding the pointers. We also collect statistics on word frequencies and on the distribution of the size of signatures M, i.e. the number M of stems which are are pointed to, both of which are important parametric components in our theoretical analysis.</Paragraph> <Paragraph position="2"> Experiments are carried out on two orthographic corpora (English and French), each of which has 100,000 word tokens.</Paragraph> </Section> <Section position="2" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 3.2 Frequency of stems and size of signatures </SectionTitle> <Paragraph position="0"> The frequency of stems as a function of their rank and the distribution of the size of signatures are plot- null show that in both the English and the French corpora, stems appear to have a distribution similar to a Zipfian one. In addition, in both corpora, M follows a distribution whose character we are not sure of, but which appears more similar to a binomial distribution. To some extent, these observations are consistent with the assumptions we made in the previous theoretical analysis.</Paragraph> </Section> <Section position="3" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 3.3 Description length of each encoding </SectionTitle> <Paragraph position="0"> The description length obtained with each encoding method is displayed in figures 9 (English corpus) and 10 (French corpus), in which the x-axis refers to the set of signatures resulting from the application of each successive heuristics, and the y-axis corresponds to the description length in bits. Note that we only plot description lengths of non-polarized lists of pointers, because the number of stems per signature is always less than half the total number of stems in these data (and we expect that this would be true for other languages as well).10 These two plots show that in both corpora, there is always a gain in description length by using binary strings rather than lists of pointers for encoding the pointers from signatures to stems. This observation is consistent with our conclusion in section 2.3, but 10See figures 6 and 8 as well as section 2.2 above.</Paragraph> <Paragraph position="1"> it is important to emphasize again that for other data (or other applications), lists of pointers might turn out to be more compact.</Paragraph> </Section> <Section position="4" start_page="37" end_page="37" type="sub_section"> <SectionTitle> 3.4 Description length gain as a function of </SectionTitle> <Paragraph position="0"> corpus size In order to evaluate the effect of corpus size on the gain in description length by using binary string rather than lists of variable-length pointers, we applied Linguistica to a number of English corpora of different sizes ranging between 5,000 to 200,000 tokens. For the final set of signatures obtained with each corpus, we then compute the gain of binary strings encoding over lists of pointers as we did in the previous experiments. The results are plotted in figure 11.</Paragraph> <Paragraph position="1"> This graph shows a strong positive correlation between description length gain and corpus size. This is reminiscent of the results of our theoretical simulations displayed in figures 3 and 4. As before, we interpret the match between the experimental results and the theoretical expectations as evidence supporting the validity of our theoretical predictions.</Paragraph> </Section> <Section position="5" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 3.5 Discussion of experiments </SectionTitle> <Paragraph position="0"> These experiments are actually a number of case studies, in which we verify the applicability of our theoretical analysis on variant definitions of pointer lengths in the MDL framework. For the particu- null phologies using pointers versus binary strings (English corpus).</Paragraph> <Paragraph position="1"> lar application we considered, learning morphology with Linguistica, binary strings encoding proves to be more compact than lists of variable-length pointers. However, the purpose of this paper is not to predict that one variant is always better, but rather to explore the mathematics behind different encodings. Armed with the mathematical analysis of different encodings, we hope to be better capable of making the right choice under specific conditions. In particular, in the suffix-discovery application (and for the languages we examined), our results are consistent with the assumptions we made and the predictions we derived from them.</Paragraph> </Section> </Section> class="xml-element"></Paper>