File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3205_intro.xml
Size: 4,904 bytes
Last Modified: 2025-10-06 14:04:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3205"> <Title>Exploring variant definitions of pointer length in MDL</Title> <Section position="2" start_page="0" end_page="32" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The fundamental hypothesis underlying the Minimum Description Length (MDL) framework (Rissanen, 1989; de Marcken, 1996; Goldsmith, 2001) is that the selection of a model for explaining a set of data should aim at satisfying two constraints: on the one hand, it is desirable to select a model that can be described in a highly compact fashion; on the other hand, the selected model should make it possible to model the data well, which is interpreted as being able to describe the data in a maximally compact fashion. In order to turn this principle into an operational procedure, it is necessary to make explicit the notion of compactness. This is not a trivial problem, as the compactness (or conversely, the length) of a description depends not only on the complexity of the object being described (in this case, either a model or a set of data given a model), but also on the &quot;language&quot; that is used for the description. Consider, for instance, the model of morphology described in Goldsmith (2001). In this work, the data consist in a (symbolically transcribed) corpus segmented into words, and the &quot;language&quot; used to describe the data contains essentially three objects: a list of stems, a list of suffixes, and a list of signatures, i.e. structures specifying which stems associate with which suffixes to form the words found in the corpus. The length of a particular model (or morphology) is defined as the sum of the lengths of the three lists that compose it; the length of each list is in turn defined as the sum of the lengths of elements in it, plus a small cost for the list structure itself1. The length of an individual morpheme (stem or suffix) is taken to be proportional to the number of symbols in it.</Paragraph> <Paragraph position="1"> Calculating the length of a signature involves the notion of pointer, with which this paper is primarily concerned. The function of a signature is to relate a number of stems with a number of suffixes.</Paragraph> <Paragraph position="2"> Since each of these morphemes is spelled once in the corresponding list, there is no need to spell it again in a signature that contains it. Rather, each signature comprises a list of pointers to stems and a list of pointers to suffixes. A pointer is a symbol that stands for a particular morpheme, and the recourse to pointers relies on the assumption that their length is lesser than that of the morphemes they replace. Following information-theoretic principles (Shannon, 1948), the length of a pointer to a morpheme (under some optimal encoding scheme) is equal to -1 times the binary logarithm of that morpheme's probability. The length of a signature is the sum of the lengths of the two lists it contains, and the length of each list is the sum of the lengths of the pointers it contains (plus a small cost for the list itself).</Paragraph> <Paragraph position="3"> This work and related approaches to unsupervised language learning have assumed that there is only one way in which items could be pointed to, or identified. The purpose of this paper is to describe, compare and evaluate several different methods, each of which satisfies MDL's basic requirements, but which have different consequences for the treatment of linguistic phenomena. One the one hand, we contrast the expected description length of &quot;standard&quot; lists of pointers with polarized lists of pointers, which are specified as either (i) pointing to the relevant morphemes (those that belong to a signature, or undergo a morpho-phonological rule, for instance) or (ii) pointing to their complement (those that do not belong to a signature, or do not undergo a rule). On the other hand, we compare (polarized) lists of pointers with a method based on binary strings specifying each morpheme as relevant or not (for a given signature, rule, etc.). In particular, we discuss the conditions under which these different ways of pointing are expected to yield more compact descriptions of the data.</Paragraph> <Paragraph position="4"> The remainder of this paper is organized as follows. In the next section, we give a formal review of the standard treatment of lists of pointers as described in (Goldsmith, 2001); then we successively introduce polarized lists of pointers and the method of binary strings, and make a first, theoretical comparison of them. Section three is devoted to an empirical comparison of these methods on a large natural language corpus. In conclusion, we discuss the implications of our results in the broader context of unsupervised language learning.</Paragraph> </Section> class="xml-element"></Paper>