XML Viewer - j01-2001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-2001_metho.xml
Size: 78,155 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2001">
  <Title>Unsupervised Learning of the Morphology of a Natural Language</Title>
  <Section position="3" start_page="154" end_page="156" type="metho">
    <SectionTitle>
4 In addition, one would like a statement of general rules of allomorphy as well; for example, a
</SectionTitle>
    <Paragraph position="0"> statement that the stems hit and hitt (as in hits and hitting, respectively) are forms of the same linguistic stem. In an earlier version of this paper, we discussed a practical method for achieving this. The work is currently under considerable revision, and we will leave the reporting on this aspect of the problem to a later paper; there is a very brief discussion below. 5 The executable is available at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000, along with instructions for use. The functions described in this paper can be incrementally applied to a corpus by the user of Linguistica.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 27, Number 2 corpus of 500,000 words in English requires about five minutes on a Pentium II 333.</Paragraph>
    <Paragraph position="2"> Perfectly respectable results can be obtained from corpora as small as 5,000 words.</Paragraph>
    <Paragraph position="3"> The system has been tested on corpora in English, French, German, Spanish, Italian, Dutch, Latin, and Russian; some quantitative results are reported below. The corpora that serve as its input are largely materials that have been obtained over the Internet, and I have endeavored to make no editorial changes to the files that are the input.</Paragraph>
    <Paragraph position="4"> In this paper, I will discuss prior work in this area (Section 2), the nature of the MDL model we propose (Section 3), heuristics for the task of the initial splitting of words into stem and affix (Section 4), the resulting signatures (Section 5), use of MDL to search the space of morphologies (Section 6), results (Section 7), the identification of entirely spurious generalizations (section 8), the grouping of signatures into larger units (Section 9), and directions for further improvements (Section 10). Finally, I will offer some speculative observations about the larger perspective that this work suggests and work in progress (Section 11).</Paragraph>
    <Paragraph position="5"> 2. Previous Research in this Area The task of automatic word analysis has intrigued workers in a range of disciplines, and the practical and theoretical goals that have driven them have varied considerably. Some, like Zellig Harris (and the present writer), view the task as an essential one in defining the nature of the linguistic analysis. But workers in the area of data compression, dictionary construction, and information retrieval have all contributed to the literature on automatic morphological analysis. (As noted earlier, our primary concern here is with morphology and not with regular allomorphy or morphophonology, which is the study of the changes in the realization of a given morpheme that are dependent on the grammatical context in which it appears, an area occasionally confused for morphology. Several researchers have explored the morphophonologies of natural language in the context of two-level systems in the style of the model developed by Kimmo Koskenniemi \[1983\], Lauri Karttunen \[1993\], and others.) The only general review of work in this area that I am aware of is found in Langer (1991), which is ten years old and unpublished.</Paragraph>
    <Paragraph position="6"> Work in automatic morphological analysis can be usefully divided into four major approaches. The first approach proposes to identify morpheme boundaries first, and thus indirectly to identify morphemes, on the basis of the degree of predictability of the n + 1st letter given the first n letters (or the mirror-image measure). This was first proposed by Zellig Harris (1955, 1967), and further developed by others, notably by Hafer and Weiss (1974). The second approach seeks to identify bigrams (and trigrams) that have a high likelihood of being morpheme internal, a view pursued in work discussed below by Klenk, Langer, and others. The third approach focuses on the discovery of patterns (we might say, of rules) of phonological relationships between pairs of related words. The fourth approach, which includes that used in this paper, is top-down, and seeks an analysis that is globally most concise. In this section, we shall review some of the work that has pursued these approaches--briefly, necessarily. 6 While not all of the approaches discussed here use no prior language-particular knowledge (which is the goal of the present system), I exclude from discussions those systems that are based essentially on a prior human-designed analysis of the grammatical morphemes of a language, aiming at identifying the stem(s) and the correct parsing; such is the 6 Another effort is that attributed to Andreev (1965) and discussed in Altmann and Lehfeldt (1980), especially p. 195 and following, though their description does not facilitate establishing a comparison with the present approach.</Paragraph>
    <Paragraph position="7">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language case, for example, in Pacak and Pratt (1976), Koch, K~stner, and Riidiger (1989), and Wothke and Schrnidt (1992). With the exception of Harris's algorithm, the complexity of the algorithms is such as to make implementation for purposes of comparison prohibitively time-consuming.</Paragraph>
    <Paragraph position="8"> At the heart of the first approach, due to Harris, is the desire to place boundaries between letters (respectively, phonemes) in a word based on conditional entropy, in the following sense. We construct a device that generates a finite list of words, our corpus, letter by letter and with uniform probability, in such a way that at any point in its generation (having generated the first n letters 111213 * * * In) we can inquire of it what the entropy is of the set consisting of the next letter of all the continuations it might make. (In current parlance, we would most naturally think of this as a path from the root of a trie to one of its terminals, inquiring of each node its associated one-letter entropy, based on the continuations from that node.) Let us refer to this as the prefix conditional entropy; clearly we may be equally interested in constructing a trie from the right edge of words, which then provides us with a suffix conditional entropy, in mirror-image fashion.</Paragraph>
    <Paragraph position="9"> Harris himself employed no probabilistic notions, and the inclusion of entropy in the formulation had to await Hafer and Weiss (1974); but allowing ourselves the anachronism, we may say that Harris proposed that local peaks of prefix (and suffix) conditional entropy should identify morpheme breaks. The method proposed in Harris (1955) appealed to what today we would call an oracle for information about the language under scrutiny, but in his 1967 article, Harris implemented a similar procedure on a computer and a fixed corpus, restricting his problem to that of finding morpheme boundaries within words. Harris's method is quite good as a heuristic for finding a good set of candidate morphemes, comparable in quality to the mutual information-based heuristic that I have used, and which I describe below. It has the same problem that good heuristics frequently have: it has many inaccuracies, and it does not lend itself to a next step, a qualitatively more reliable approximation of the correct solution. 7 Hafer and Weiss (1974) explore in detail various ways of clarifying and improving on Harris's algorithm while remaining faithful to the original intent. A brief summary does not do justice to their fascinating discussion, but for our purposes, their results confirm the character of the Harrisian test as heuristic: with Harris's proposal, a quantitative measure is proposed (and Hafer and Weiss develop a range of 15 different measures, all of them rooted in Harris's proposal), and best results for morphological analysis are obtained in some cases by seeking a local maximum of prefix conditional entropy, in others by seeking a value above a threshold, and in yet others, good results are obtained only when this measure is paired with a similar measure constructed in mirror-image fashion from the end of the word--and then some arbitrary thresholds are selected which yield the best results. While no single method emerges as the best, one of the best yields precision of 0.91 and recall of 0.61 on a corpus of approximately 6,200 word types. (Precision here indicates proportion of predicted morpheme breaks that are correct, and recall denotes the proportion of correct breaks that are predicted.) The second approach that can be found in the literature is based on the hypothesis that local information in the string of letters (respectively, phonemes) is sufficient to identify morpheme boundaries. This hypothesis would be clearly correct if all morpheme boundaries were between pairs of letters 11-12 that never occur in that sequence</Paragraph>
  </Section>
  <Section position="4" start_page="156" end_page="163" type="metho">
    <SectionTitle>
7 But Harris's method does lend itself to a generalization to more difficult cases of morphological analysis going beyond the scope of the present paper. In work in progress, we have used minimization
</SectionTitle>
    <Paragraph position="0"> of mutual information between successive candidate morphemes as part of a heuristic for preferring a morphological analysis in languages with a large number of suffixes per word.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 27, Number 2 morpheme internally, and the hypothesis would be invalidated if conditional probabilities of a letter given the previous letter were independent of the presence of an intervening boundary. The question is where real languages distribute themselves along the continuum that stretches between these two extremes.</Paragraph>
    <Paragraph position="2"> A series of publications has explored this question, including Janssen (1992), Klenk (1992), and Flenner (1994, 1995). Any brief description that overlooks the differences among these publications is certain to do less than full justice to all of them. The procedure described in Janssen (1992) and Flenner (1994, 1995) begins with a training corpus with morpheme boundaries inserted by a human, and hence the algorithm is not in the domain of unsupervised learning. Each bigram (and the algorithm has been extended in the natural way to treating trigrams as well) is associated with a triple (whose sum must be less than or equal to 1.0) indicating the frequency in the training corpus of a morpheme boundary occurring to the left of, between, or to the right of that bigram. In a test word, each space between letters (respectively, phonemes) is assigned a score that is the sum of the relevant values derived from the training session: in the word string, for example, the score for the potential cut between str and ing is the sum of three values: the probability of a morpheme boundary after tr (given tr), the probability of a morpheme boundary between r and i (given ri), and the probability of a morpheme boundary before in (given in).</Paragraph>
    <Paragraph position="3"> That these numbers should give some indication of the presence of a morpheme boundary is certain, for they are the sums of numbers that were explicitly assigned on the basis of overtly marked morpheme boundaries. But it remains unclear how one should proceed further with the sum. As Hafer and Weiss discover with Harris's measure, it is unclear whether local peaks of this measure should predict morpheme boundaries, or whether a threshold should be set, above which a morpheme boundary is predicted. Flenner (1995, 64-65) and proponents of this approach have felt some freedom on making this choice in an ad hoc fashion. Janssen (1992, 81-82) observes that the French word linguistique displays three peaks, predicting the analysis linguist-ique, employing a trigram model. The reason for the strong, but spurious, peak after lin is that lin occurs with high frequency word finally, just as gui appears with high frequency word initially. One could respond to this observation in several ways: word-final frequency should not contribute to word-internal, morpheme-final status; or perhaps frequencies of this sort should not be added. Indeed, it is not clear at all why these numbers should be added; they do not, for example, represent probabilities that can be added. Janssen notes that the other two trigrams that enter into the picture (ing and ngu) had a zero frequency of morpheme break in the desired spot, and proposes that the presence of any zeros in the sum forces the sum to be 0, raising again the question of what kind of quantity is being modeled; there is no scholarly tradition according to which the presence of zero in a sum should lead to a total of 0.</Paragraph>
    <Paragraph position="4"> I do not have room to discuss the range of greedy affix-parsing algorithms these authors explore, but that aspect of their work has less bearing on the comparison with the present paper, whose focus is on data-driven learning. The major question to carry away from this approach is this: can the information that is expressed in the division of a set of words into morphemes be compressed into local information (bigrams, trigrams)? The answer, I believe, is in general negative. Morphology operates at a higher level, so to speak, and has only weak statistical links to local sequencing of phonemes or letters, s 8 On this score, language will surely vary to some degree. English, for example, tends to employ rules of morphophonology to modify the surface form of morphologically complex words so as to better match the phonological pattern of unanalyzed words. This is discussed at length in Goldsmith (1990, Chap. 5).  Goldsmith Unsupervised Learning of the Morphology of a Natural Language The third approach focuses on the discovery of patterns explicating the overt shapes of related forms in a paradigm. Dzeroski and Erjavec (1997) report on work that they have done on Slovene, a South Slavic language with a complex morphology, in the context of a similar project. Their goal essentially was to see if an inductive logic program could infer the principles of Slovene morphology to the point where it could correctly predict the nominative singular form of a word if it were given an oblique (nonnominative) form. Their project apparently shares with the present one the requirement that the automatic learning algorithm be responsible for the decision as to which letters constitute the stem and which are part of the suffix(es), though the details offered by Dzeroski and Erjavec are sketchy as to how this is accomplished.</Paragraph>
    <Paragraph position="5"> In any event, they present their learning algorithm with a labeled pair of words--a base form and an inflected form. It is not clear from their description whether the base form that they supply is a surface form from a particular point in the inflectional paradigm (the nominative singular), or a more articulated underlying representation in a generative linguistic sense; the former appears to be their policy.</Paragraph>
    <Paragraph position="6"> Dzeroski and Erjavec's goal is the development of rules couched in traditional linguistic terms; the categories of analysis are decided upon ahead of time by the programmer (or, more specifically, by the tagger of the corpus), and each individual word is identified with regard to what morphosyntactic features it bears. The form bolecina is marked, for example, as a feminine noun singular genitive. In sum, their project thus gives the system a good deal more information than the present project does. 9 Two recent papers, Jacquemin (1997) and Gaussier (1999), deserve consideration here. 1deg Gaussier (1999) approaches a very similar task to that which we consider, and takes some similar steps. His goal is to acquire derivational rules from an inflectional lexicon, thus insuring that his algorithm has access to the lexical category of the words it deals with (unlike the present study, which is allowed no such access). Using the terminology of the present paper, Gaussier considers candidate suffixes if they appear with at least two stems of length 5. His first task is (in our terms) to infer paradigms from signatures (see Section 9), which is to say, to find appropriate clusters of signatures. One example cited is depart, departure, departer. He used a hierarchical agglomerative clustering method, which begins with all signatures forming distinct clusters, and successively collapses the two most similar clusters, where similarity between stems is defined as the number of suffixes that two stems share, and similarity between clusters is defined as the similarity between the two least similar stems in the respective clusters. He reports a success rate of 77%, but it is not clear how to evaluate this figure. 11 The task that Gaussier addresses is defined from the start to be that of derivational morphology, and because of that, his analysis does not need to address the problem of inflectional morphology, but it is there (front and center, so to speak) that the difficult clustering problem arises, which is how to ensure that the signatures NULL.s.'s (for nouns in English) and the signature NULL.ed.s (or NULL.ed.ing.s) are not assigned to single clusters. 12 That is, in English both nouns and verbs freely occur with the suffixes 9 Baroni (2000) reported success using an MDL-based model in the task of discovering English prefixes. I have not had access to further details of the operation of the system.</Paragraph>
    <Paragraph position="7"> 10 I am grateful to a referee for drawing my attention to these papers. 11 The analysis of a word w in cluster C counts as a success if most of the words that in fact are related to w also appear in the cluster C, and if the cluster &amp;quot;comprised in majority words of the derivational family of w.&amp;quot; I am not certain how to interpret this latter condition; it means perhaps that more than half of the words in C contain suffixes shared by forms related to w. 12 In traditional terms, inflectional morphology is responsible for marking different forms of the same lexical item (lemma), while derivational morphology is responsible for the changes in form between distinct but morphologically related lexical items (lemmas).</Paragraph>
    <Paragraph position="8">  Computational Linguistics Volume 27, Number 2 NULL and -s, and while -ed and -~s disambiguate the two cases, it is very difficult to find a statistical and morphological basis for this knowledge, lB Jacquemin (1997) explores an additional source of evidence regarding clustering of hypothesized segmentation of words into stems and suffixes; he notes that the hypothesis that there is a common stem gen in gene and genetic, and a common stem express in expression and expressed, is supported by the existence of small windows in corpora containing the word pair genetic...expression and the word pair gene.., expressed (as indicated, the words need not be adjacent in order to provide evidence for the relationship). As this example suggests, Jacquemin's work is situated within the context of a desire for superior information retrieval.</Paragraph>
    <Paragraph position="9"> In terms of the present study, Jacquemin's algorithm consists of (1) finding signatures with the longest possible stems and (2) establishing pairs of stems that occur together in two or more windows of length 5 or less. He tests his results on 100 random pairs discovered in this fashion, placing upper bounds on the length of the suffix permitted between one and five letters, and independently varying the length of the window in question. He does not vary the minimum size of the stem, a consideration that turns out to be quite important in Germanic languages, though less so in Romance languages. He finds that precision varies from 97% when suffixes are limited to a length of one letter, to 64% when suffixes may be five letters long, with both figures assuming an adjacency window of two words; precision falls to 15% when a window of four words is permitted.</Paragraph>
    <Paragraph position="10"> Jacquemin also employs the term signature in a sense not entirely dissimilar to that employed in the present paper, referring to the structured set of four suffixes that appear in the two windows (in the case above, the suffixes are -ion, -ed; NULL, -tic). He notes that incorrect signatures arise in a large number of cases (e.g., good: optical control ~ optimal control; adoptive transfer ~ adoptively tranfer, paralleled by bad: ear disease ~ early disease), and suggests a quality function along the following lines: Stems are linked in pairs (adopt-transfer, ear-disease); compute then the average length of the shorter stem in each pair (that is, create a set of the shorter member of each pair, and find the average length of that set). The quality function is defined as that average divided by the length of the largest suffix in the signature; reject any signature class for which that ratio is less than 1.0. This formula, and the threshold, is purely empirical, in the sense that there is no larger perspective that bears on determining the appropriateness of the formula, or the values of the parameters.</Paragraph>
    <Paragraph position="11"> The strength of this approach, clearly, is its use of information that co-occurrence in a small window provides regarding semantic relatedness. This allows a more aggressive stance toward suffix identification (e.g., alpha interferon ~ alpha2 interferon). There can be little question that the type of corpus studied (a large technical medical corpus, and a list of terms--partially multiword terms) lends itself particularly to this style of inference, and that similar patterns would be far rarer in unrestricted text such as Tom Sawyer or the Brown corpus. 14 13 Gaussier also offers a discussion of inference of regular morphophonemics, which we do not treat in the present paper, and a discussion in a final section of additional analysis, though without test results. Gaussier aptly calls our attention to the relevance of minimum edit distance relating two potential allomorphs, and he proposes a probabilistic model based on patterns established between allomorphs. In work not discussed in this paper, I have explored the integration of minimum edit distance to an MDL account of allomorphy as well, and will discuss this material in future work.</Paragraph>
    <Paragraph position="12"> 14 In a final section, Jacquemin considers how his notion of signatures can be extended to identify sets of related suffixes (e.g., onic/atic/ic--his example). He uses a greedy clustering algorithm to successively add nonclustered signatures to clusters, in a fashion similar to that of Gaussier (who Jacquemin thanks for discussion, and of course Jacquemin's paper preceded Gaussier's paper by two years), using a  Naive description length.</Paragraph>
    <Paragraph position="13"> The fourth approach to morphology analysis is top-down, and seeks a globally optimal analysis of the corpus. This general approach is based on the insight that the number of letters in a list of words is greater than the number of letters in a list of the stems and affixes that are present in the original list. This is illustrated in Figure 1. This simple observation lends hope to the notion that we might be able to specify a relatively simple figure of merit independently of how we attempt to find analyses of particular data. This view, appropriately elaborated, is part of the minimum description length approach that we will discuss in detail in this paper.</Paragraph>
    <Paragraph position="14"> Kazakov (1997) presents an analysis in this fourth approach, using a straightforward measurement of the success of a morphological analysis that we have mentioned, counting the number of letters in the inventory of stems and suffixes that have been hypothesized; the improvement in this count over the number of letters in the original word list is a measure of the fitness of the analysis. 15 He used a list of 120 French words in one experiment, and 39 forms of the same verb in another experiment, and employed what he terms a genetic algorithm to find the best cut in each word. He associated each of the 120 words (respectively, 39) with an integer (between 1 and the length of the word minus 1) indicating where the morphological split was to be, and measured the fitness of that grammar in terms of its decrease in number of total letters. He does not describe the fitness function used, but seems to suggest that the metric more complex than the familiar minimum edit distance, but no results are offered in support of the choice of the additional complexity.</Paragraph>
    <Paragraph position="15"> 15 I am grateful to Scott Meredith for drawing my attention to this paper.</Paragraph>
    <Paragraph position="16">  Computational Linguistics Volume 27, Number 2 single top-performing grammar of each generation is preserved, all others are eliminated, and the top-performing grammar is then subjected to mutation. That is, in a case-by-case fashion, the split between stems and suffixes is modified (in some cases by a shift of a single letter, in others by an unconstrained shift to another location within the word) to form a new grammar. In one experiment described by Kazakov, the population was set to 800, and 2,000 generations were modeled. On a Pentium 90 and a vocabulary of 120 items, the computation took over eight hours.</Paragraph>
    <Paragraph position="17"> Work by Michael Brent (1993) and Carl de Marcken (1995) has explored analyses of the fourth type as well. Researchers have been aware of the utility of the information-theoretic notion of compression from the earliest days of information theory, and there have been efforts to discover useful, frequent chunks of letters in text, such as Radhakrishnan (1978), but to my knowledge, Brent's and de Marcken's works were the first to explicitly propose the guiding of linguistic hypotheses by such notions. Brent's work addresses the question of determining the correct morphological analysis of a corpus of English words, given their syntactic category, utilizing the notion of minimal encoding, while de Marcken's addresses the problem of determining the &amp;quot;breaking&amp;quot; of an unbroken stream of letters or phonemes into chunks that correspond as well as possible to our conception of words, implementing a well-articulated algorithm couched in a minimum description length framework, and exploring its effects on several large corpora.</Paragraph>
    <Paragraph position="18"> Brent (1993) aims at finding the appropriate set of suffixes from a corpus, rather than the more comprehensive goal of finding the correct analysis for each word, both stem and suffix, and I think it would not be unfair to describe it as a test-of-concept trial on a corpus ranging in size from 500 to 8,000 words; while this is not a small number of words, our studies below focus on corpora with on the order of 30,000 distinct words. Brent indicates that he places other limitations as well on the hypothesis space, such as permitting no suffix which ends in a sequence that is also a suffix (i.e., if s is a suffix, then less and ness are not suffixes, and if y is a suffix, ity is not).</Paragraph>
    <Paragraph position="19"> Brent's observation is very much in line with the spirit of the present analysis: &amp;quot;The input lexicons contained thousands of non-morphemic endings and mere dozens of morphemic suffixes, but the output contained primarily morphemic suffixes in all cases but one. Thus, the effects of non-morphemic regularities are minimal&amp;quot; (p. 35). Brent's corpora were quite different from those used in the experiments reported below; his were based on choosing the n most common words in a Wall Street Journal corpus, while the present study has used large and heterogeneous sources for corpora, which makes for a considerably more difficult task. In addition, Brent scored his algorithm solely on how well it succeeded in identifying suffixes (or combinations of suffixes), rather than on how well it simultaneously analysed stem and suffix for each word, the goal of the present study. ~6 Brent makes clear the relevance and importance of information-theoretic notions, but does not provide a synthetic and overall measure of the length of the morphological grammar.</Paragraph>
    <Paragraph position="20"> 16 Brent's description of his algorithm is not detailed enough to satisfy the curiosity of someone like the present writer, who has encountered problems that Brent's approach would seem certain to encounter equally. As we shall see below, the central practical problem to grapple with is the fact that when considering suffixes (or candidate suffixes) consisting of only a single letter (let us say, s, for example), it is extremely difficult to get a good estimate of how many of the potential occurrences (of word-final s) are suffixal s and how many are not. As we shall suggest towards the end of this paper, the only accurate way to make an estimate is on the basis of a multinomial estimate once larger suffix signatures have been established. Without this, it is difficult not to overestimate the frequency of single-letter suffixes, a result that may often, in my experience, deflect the learning algorithm from discovering a correct two-letter suffix (e.g., the suffix -al in French).</Paragraph>
    <Paragraph position="21">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language De Marcken (1995) addresses a similar but distinct task, that of determining the correct breaking of a continuous stream of segments into distinct words. This problem has been addressed in the context of Asian (Chinese-Japanese-Korean) languages, where standard orthography does not include white space between words, and it has been discussed in the context of language acquisition as well. De Marcken describes an unsupervised learning algorithm for the development of a lexicon using a minimum description length framework. He applies the algorithm to a written corpus of Chinese, as well as to written and spoken corpora of English (the English text has had the spaces between words removed), and his effort inspired the work reported here.</Paragraph>
    <Paragraph position="22"> De Marcken's algorithm begins by taking all individual characters to be the baseline lexicon, and it successively adds items to the lexicon if the items will be useful in creating a better compression of the corpus in question, or rather, when the improvement in compression yielded by the addition of a new item to the codebook is greater than the length (or &amp;quot;cost&amp;quot;) associated with the new item in the codebook. In general, a lexical item of frequency F can be associated with a compressed length of - log F, and de Marcken's algorithm computes the compressed length of the Viterbi-best parse of the corpus, where the compressed length of the whole is the sum of the compressed lengths of the individual words (or hypothesized chunks, we might say) plus that of the lexicon. In general, the addition of chunks to the lexicon (beginning with such high-frequency items as th) will improve the compression of the corpus as a whole, and de Marcken shows that successive iterations add successively larger pieces to the lexicon. De Marcken's procedure builds in a bottom-up fashion, looking for larger and larger chunks that are worth (in an MDL sense) assigning the status of dictionary entries. Thus, if we look at unbroken orthographic texts in English, the two-letter combination th will become the first candidate chosen for lexical status; later, is will achieve that status too, and soon this will as well. The entry this will not, in effect, point to its four letters directly, but will rather point to the chunks th and is, which still retain their status in the lexicon (for their robust integrity is supported by their appearance throughout the lexicon). The creation of larger constituents will occasionally lead to the elimination of smaller chunks, but only when the smaller chunk appears almost always in a single larger unit.</Paragraph>
    <Paragraph position="23"> An example of an analysis provided by de Marcken's algorithm is given in (1), taken from de Marcken (1995), in which I have indicated the smallest-level constituent by placing letters immediately next to one another, and then higher structure with various pair brackets (parentheses, etc.) for orthographic convenience; there is no theoretical significance to the difference between &amp;quot;( )&amp;quot; and &amp;quot;0&amp;quot;, etc. De Marcken's analysis succeeds quite well at identifying words, but does not make any significant effort at identifying morphemes as such.</Paragraph>
    <Paragraph position="25"> Applying de Marcken's algorithm to a &amp;quot;broken&amp;quot; corpus of a language in which word boundaries are indicated (for example, English) provides interesting results, but none that provide anything even approaching a linguistic analysis, such as identification of stems and affixes. The broken character of the corpus serves essentially as an upper bound for the chunks that are postulated, while the letters represent the lower bound.</Paragraph>
    <Paragraph position="26"> De Marcken's MDL-based figure of merit for the analysis of a substring of the corpus is the sum of the inverse log frequencies of the components of the string in question; the best analysis is that which minimizes that number (which is, again, the optimal compressed length of that substring), plus the compressed length of each  Computational Linguistics Volume 27, Number 2 of the lexical items that have been hypothesized to form the lexicon of the corpus. It would certainly be natural to try using this figure of merit on words in English, along with the constraint that all words should be divided into exactly two pieces. Applied straightforwardly, however, this gives uninteresting results: words will always be divided into two pieces, where one of the pieces is the first or the last letter of the word, since individual letters are so much more common than morphemes. 17 (I will refer to this effect as peripheral cutting below.) In addition--and this is less obvious--the hierarchical character of de Marcken's model of chunking leaves no place for a qualitative difference between high-frequency &amp;quot;chunks,&amp;quot; on the one hand, and true morphemes, on the other: str is a high-frequency chunk in English (as schl is in German), but it is not at all a morpheme. The possessive marker ~s, on the other hand, is of relatively low frequency in English, but is clearly a morpheme.</Paragraph>
    <Paragraph position="27"> MDL is nonetheless the key to understanding this problem. In the next section, I will present a brief description of the algorithm used to bootstrap the problem, one which avoids the trap mentioned briefly in note 21. This provides us with a set of candidate splittings, and the notion of the signature of the stem becomes the working tool for determining which of these splits is linguistically significant. MDL is a framework for evaluating proposed analyses, but it does not provide a set of heuristics that are nonetheless essential for obtaining candidate analyses, which will be the subject of the next two sections.</Paragraph>
  </Section>
  <Section position="5" start_page="163" end_page="169" type="metho">
    <SectionTitle>
3. Minimum Description Length Model
</SectionTitle>
    <Paragraph position="0"> The central idea of minimum description length analysis (Rissanen 1989) is composed of four parts: first, a model of a set of data assigns a probability distribution to the sample space from which the data is assumed to be drawn; second, the model can then be used to assign a compressed length to the data, using familiar information-theoretic notions; third, the model can itself be assigned a length; and fourth, the optimal analysis of the data is the one for which the sum of the length of the compressed data and the length of the model is the smallest. That is, we seek a minimally compact specification of both the model and the data, simultaneously. Accordingly, we use the conceptual vocabulary of information theory as it becomes relevant to computing the length, in bits, of various aspects of the morphology and the data representation.</Paragraph>
    <Section position="1" start_page="163" end_page="167" type="sub_section">
      <SectionTitle>
3.1 A First Model
</SectionTitle>
      <Paragraph position="0"> Let us suppose that we know (part of) the correct analysis of a set of words, and we wish to create a model using that knowledge. In particular, we know which words have no morphological analysis, and for all the words that do have a morphological analysis, we know the final suffix of the word. (We return in the next section to how we might arrive at that knowledge.) An MDL model can most easily be conceptualized if we encode all such knowledge by means of lists; see Figure 2. In the present case, we have three lists: a list of stems, of suffixes, and of signatures. We construct a list of the stems of the corpus defined as the set of the unanalyzed words, plus the material that precedes the final suffix of each morphologically analyzed word. We also construct a list of suffixes that occur with at least one stem. Finally, each stem is empirically associated with a set of suffixes (those with which it appears in the corpus); we call this set the stem's signature, and we construct a third list, consisting of the signatures that appear in this corpus. This third list, however, contains no letters (as the other 17 See note 21 below.</Paragraph>
      <Paragraph position="1">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language A. Affixes: 6 1. NULL 2. ed 3. ing 4. s 5. e i 6. es B. Stems: 9 !1. cat 2. dog 3. hat 4. John 5. jump 6. laugh 7. sav 8. the 9. walk C. Signatures: 4 Signature 1: / treat</Paragraph>
      <Paragraph position="3"> A sample morphology. This morphology covers the words: cat, cats, dog, dogs, hat, hats, save, saves, saving, savings, jump, jumped, jumping, jumps, laugh, laughed, laughing, laughs, walk, walked, walking, walks, the, John.</Paragraph>
      <Paragraph position="4"> lists do), but rather pointers to stems and suffixes. We do this, in one sense, because our goal is to construct the smallest morphology, and in general a pointer requires less information than an explicit set of letters. But in a deeper sense, it is the signatures whose compactness provides the explicit measurement of the conciseness of the entire analysis. Note that by construction, each stem is associated with exactly one signature.</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 27, Number 2 Since stem, suffix, and signature all begin with s, we opt for using t to represent a stem, f to represent a suffix, and cr to represent a signature, while the uppercase T, F, E represent the sets of stems, suffixes, and signatures, respectively. The number of members of such a set will be represented (T) , (F/, etc., while the number of occurrences of a stem, suffix, etc., will be represented as \[t\], \[f\], etc. The set of all words in the corpus will be represented as W; hence the length of the corpus is \[W\], and the size of the vocabulary is (W).</Paragraph>
      <Paragraph position="6"> Note the structure of the signatures in Figure 2. Logically a signature consists of two lists of pointers, one a list of pointers to stems, the other a list of pointers to suffixes. To specify a list of length N, we must specify at the beginning of the signature that N items will follow, and this requires just slightly more than log 2 N bits to do (see Rissanen \[1989, 33-34\] for detailed discussion); I will use the notation A(N) to indicate this function.</Paragraph>
      <Paragraph position="7"> A pointer to a stem t, in turn, is of length -log prob (t), a basic principle of information theory (Li and Vit8nyi 1997). Hence the length of a signature is the sum of the (inverse) log probabilities of its stems, plus that of its suffixes, plus the number of bits it takes to specify the number of its stems and suffixes, using the A function. We will return in a moment to how we determine the probabilities of the stems and suffixes; looking ahead, it will be the empirical frequency.</Paragraph>
      <Paragraph position="8"> Let us consider the length of stem list T. As we have already observed, its length is ),((T))--this is the length of the information specifying how long the list is--plus the length of each stem specification. In most of our work, we make the assumption that the length of a stem is the number of letters in it, weighted by the factor log 26 converting to binary bits, in a language with 26 lettersJ 8 The same reasoning holds for the suffix list F: its length is X((F)) plus the length of each suffix, which we may take to be the total number of letters in the suffix times log 26.</Paragraph>
      <Paragraph position="9"> We return to the question of how long the pointer (found inside a signature) to a stem or suffix is. The probability of a stem is its (empirical) frequency, i.e., the total number of words in the corpus corresponding to the words whose analysis includes the stem in question; the probability of a suffix is defined in parallel fashion. Using W to indicate all the words of the corpus, we may say that the length of a pointer to a stem t is of length a pointer to suffix f is of length log \[w\] \[t\] ' log \[% K' 18 This is a reasonable, and convenient, assumption, but it may not be precise enough for all work. A more refined measure would take the length of a letter to be -1 times the binary log of its frequency. A still more refined measure would base the probability of a letter on bigram context; this matters for English, where stem final t is very common. In addition, there is information in the linear order in which the letters are stored, roughly equal to</Paragraph>
      <Paragraph position="11"> for a string of length n (compare the information that distinguishes the lexical representation of anagrams). This is an additional consideration in an MDL analysis of morphology pressing in favor of breaking words into morphemes when possible.</Paragraph>
      <Paragraph position="12">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language and a pointer to a signature cr is of length \[w\] log -\[cr\] &amp;quot; We have now settled the question of how to determine the length of our initial model; we next must determine the probability that the model assigns to each word in the corpus, and armed with that knowledge, we will be able to compute the compressed length of the corpus.</Paragraph>
      <Paragraph position="13"> The morphology assigns a probability to each word w as the product of the probability of w's signature times w's stem, given its signature, and w's suffix, given its signature: prob (w = t +f) = prob (c 0 prob (t I or) prob (f \] or), where cr is the signature associated with t: cr = sig(t). Thus while stems and suffixes, which are defined relative to a particular morphological model, are assigned their empirical frequency as their probability, words are assigned a probability based on the model, one which will always depart from the empirical frequency. The compression to the corpus is thus worse than would be a compression based on word frequency alone, 19 or to put it another way, the morphological analysis in which all words are unanalyzed is the analysis in which each word is trivially assigned its own empirical frequency (since the word equals the stem). But this decrease in compression that comes with morphological analysis is the price willingly paid for not having to enter every distinct word in the stem list of the morphology.</Paragraph>
      <Paragraph position="14"> Summarizing, the compressed length of the corpus is</Paragraph>
      <Paragraph position="16"> where we have summed over the words in the corpus, and or(w) is the signature to which word w is assigned. The compressed length of the model is the length of the stem list, the suffix list, and the signature list. The length in bits of the stem list is</Paragraph>
      <Paragraph position="18"> and the length of the suffix list is A((r)) + L, po(f), f ff Suffixes where LtvpoO is the measurement of the length of a string of letters in bits, which we take to be log 2 26 times the number of letters (but recall note 18). The length of the signature list is A((~,)) + Z L(C/), c~ ff Sign atures where L(~) is the length of signature or. If the set of stems linked to signature a is T(~r) and the set of suffixes linked to signature a is F(a), then + + S-&amp;quot; log \[w\] + fcr(C/)Z log \[words(f) N words(cr)\]&amp;quot; 19 Due to the fact that the cross-entropy is always greater than or equal to the entropy.  Computational Linguistics Volume 27, Number 2 (The denominator in the last term consists of the token count of words in a particular signature with the given suffix f, and we will refer to this below more simply as in cr\].) It is no doubt easy to get lost in the formalism, so it may be helpful to point out what the contribution of the additional structure accomplishes. We observed above that the MDL analysis is an elaboration of the insight that the best morphological analysis of a corpus is obtained by counting the total number of letters in the list of stems and suffixes according to various analyses, and choosing the analysis for which this sum is the least (cf. Figure 2). This simple insight fails rapidly when we observe in a language such as English that there are a large number of verb stems that end in t. Verbs appear with a null suffix (that is, in bare stem form), with the suffixes -s, -ed, and -ing. But once we have 11 stems ending in t, the naive letter-counting approach will judge it a good idea to create a new set of suffixes: -t, -ted, -ts, and -ting, because those 10 letters will allow us to remove 11 or more letters from the list of stems. It is the creation of the lists, notably the signature list, and an information cost which increases as probability decreases, that overcomes that problem. Creating a new signature may save some information associated with the stem list in the morphology, but since the length of pointers to a signature cr is - log freq (0), the length of the pointers to the signatures for all of the words in the corpus associated with the old signature (-O, -ed, -s, -ing) or the new signature (-ts, -ted, -ting, -ts) will be longer than the length of the pointers to a signature whose token count is the sum of the token count of the two combined, i.e., xldegg (~-~)+yldegg (~)~ (x+y)ldegg (x-~y) *</Paragraph>
    </Section>
    <Section position="2" start_page="167" end_page="169" type="sub_section">
      <SectionTitle>
3.2 Recursive Morphological Structure
</SectionTitle>
      <Paragraph position="0"> The model presented above is too simple in that it underestimates the gain achieved by morphological analysis in case the word that is analyzed is also a stem of a larger word. For example, if a corpus contains the words work and working, then morphological analysis will allow us to dispense with the form working; it is modeled by the stem work and the suffixes -O and -ing. If the corpus also includes workings, the analysis working-s additionally lowers the cost of the stem working. Clearly we would like stems to be in turn analyzable as stems + suffixes. Implementing this suggestion involves the following modifications: (i) Each pointer to a stem (and these are found both in the compressed representation of each individual word in the corpus, and inside the individual signatures of the morphological model) must contain a flag indicating whether what follows is a pointer to a simple member of the stem list (as in the original model), or a triple pointer to a signature, stem, and suffix. In the latter case, which would be the case for the word \[work-ing\]-s, the pointer to the stem consists of a triple identical to the signature for the word work-ing. (ii) The number of words in the corpus has now changed, in that the word \[work-ing\]-s now contains two words, not one. We will need to distinguish between counts of a word w where w is a freestanding word, and counts where it is part of a larger word; we shall refer to the latter class as secondary counts. In order to simplify computation and exposition, we have adopted the convention that the total number of words remains fixed, even when nested structure is posited by the morphology, thus forcing the convention that counts are distributed in a nonintegral fashion over the two or more nested word structures found in complex words. We consider the more complex case in the appendix. 2deg 20 In addition, the number of words in a corpus will change if the analysis determines that all occurrences of (let us say) -ings are to be reanalyzed as complex words, and the stem in question  Goldsmith Unsupervised Learning of the Morphology of a Natural Language We may distinguish between those words, like work or working, whose immediate analysis involves a stem appearing in the stem list (we may call these WSIMPLE ) and those whose analysis, like workings, involves recursive structure (we may call these WCOMPLEX ). AS we have noted, every stern entry in a signature begins with a flag indicating which kind of stem it is, and this flag will be of length \[wl log \[WsIMPLE \] for simple stems, and of length \[w\] log \[WcoMPrZX\] for complex stems. We also keep track separately of the total number of words in the corpus (token count) that are morphologically analyzed, and refer to this set as WA; this consists of all words except those that are analyzed as having no suffix (see item  (ii) in (2), below).</Paragraph>
      <Paragraph position="1"> (2) Compressed length of morphology  (b) Size of the count of the number of stems plus size of the count of the number of suffixes: ;~((stems(a))) + ~((suffixes(a))) (c) A pointer to each stem, consisting of a simple/complex flag, and a pointer to either a simple or complex stem: (i) Case of simple stem: flag of length \[w\] log \[WsIMPLE\] (perhaps work-ing) did not appear independently as a freestanding word in the corpus; we will refer to these inferred words as being &amp;quot;virtual&amp;quot; words with virtual counts.  Computational Linguistics Volume 27, Number 2 plus a pointer to a stem of length log \[w\]. \[t\] '  log \[stem(t)~-~ + log \[suffix(t) in cr\]&amp;quot; (d) a pointer to each suffix, of total length v'z_. log ~ in ~\] f c suyfixe~ ( ~ ) (3) Compressed length of corpus \[w\] \[~(w)\] \[~(w)\] \] \[w\] log ~ + log + log \[stem(w)\] \[suffix(w)in a(w)\]\] wEW  MDL thus provides a figure of merit that we wish to minimize, and we will seek heuristics that modify the morphological analysis in such a fashion as to decrease this figure of merit in a large proportion of cases. In any given case, we will accept a modification to our analysis just in case the description length decreases, and we will suggest that this strategy coincides with traditional linguistic judgment in all clear cases.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="169" end_page="172" type="metho">
    <SectionTitle>
4. Heuristics for Word Segmentation
</SectionTitle>
    <Paragraph position="0"> The MDL model designed in the preceding section will be of use only if we can provide a practical means of creating one or more plausible morphologies for a given corpus.</Paragraph>
    <Paragraph position="1"> That is, we need bootstrapping heuristics that enable us to go from a corpus to such a morphology. As we shall see, it is not in fact difficult to come up with a plausible initial morphology, but I would like to consider first an approach which, though it might seem like the most natural one to try, fails, and for an interesting reason.</Paragraph>
    <Paragraph position="2"> The problem we wish to solve can be thought of as one suited to an expectation-maximization (EM) approach (Dempster, Laird, and Rubin 1977). Along such a line, each word w of length N would be initially conceived of as being analyzed in N different ways, cutting the word into stem + suffix after i letters, 1 K i &lt; N, with each of these N analyses being assigned probability mass of  Goldsmith Unsupervised Learning of the Morphology of a Natural Language That probability mass is then summed over the resulting set of stems and suffixes, and on successive iterations, each of the N cuts into stem + suffix is weighted by its probability; that is, if the ith cut of word w, of length I, cuts it into a stem t of length i and suffix of length 1 - i, then the probability of that cut is defined as</Paragraph>
    <Paragraph position="4"> where ZOj,k refers to the substring of w from the jth to the kth letter. Probability mass for the stem and the suffix in each such cut is then augmented by an amount equal to the frequency of word w times the probability of the cut. After several iterations (approximately four), estimated probabilities stabilize, and each word is analyzed on the basis of the cut with the largest probability.</Paragraph>
    <Paragraph position="5"> This initially plausible approach fails because it always prefers an analysis in which either the stem or (more often) the suffix consists of a single letter. More importantly, the probability that a sequence of one or more word-final letters is a suffix is very poorly modeled by the sequence's frequency. 21 To put the point another way, even the initial heuristic analyzing one particular word must take into account all of the other analyses in a more articulated way than this particular approach does.</Paragraph>
    <Paragraph position="6"> I will turn now to two alternative heuristics that succeed in producing an initial morphological analysis (and refer to a third in a note). It seems likely that one could construct a number of additional heuristics of this sort. The point to emphasize is that the primary responsibility of the overall morphology is not that of the initial heuristic, but rather of the MDL model described in the previous section. The heuristics described in this section create an initial morphology that can serve as a starting point in a search for the shortest overall description of the morphology. We deal with that process in Section 5.</Paragraph>
    <Section position="1" start_page="170" end_page="171" type="sub_section">
      <SectionTitle>
4.1 First Heuristic
</SectionTitle>
      <Paragraph position="0"> A heuristic that I will call the take-all-splits heuristic, and which considers all cuts of a word of length 1 into stem+suffix Wl,i -t- Wi+l,l, where 1 G i &lt; 1, much like the EM approach mentioned immediately above, works much more effectively if the probability is assigned on the basis of a Boltzmann distribution; see (4) below. The function H(.) in (4) assigns a value to a split of word w of length h w U + wi+l,l. H does not assign a proper distribution; we use it to assign a probability to the cut of w into w~,i + wi+u as in (5). Clearly the effect of this model is to encourage splits containing relatively long suffixes and stems.</Paragraph>
      <Paragraph position="2"> 21 It is instructive to think about why this should be so. Consider a word such as diplomacy. If we cut the word into the pieces diplomac + y, its probability is freq (diplomac)* freq (y), and constrast that value with the corresponding values of two other analyses: freq (diploma)* freq (cy), and freq (diplom)* freq (acy). Now, the ratio of the frequency of words that begin with diploma and those that begin with diplomac is less than 3, while the ratio of the frequency of words that end in y and those that end in cy is much greater. In graphical terms, we might note that tries (the data structure) based on forward spelling have by far the greatest branching structure early in the word, while tries based on backward spelling have the greatest branching structure close to the root node, which is to say at the end of the word.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 27, Number 2 where</Paragraph>
      <Paragraph position="5"> For each word, we note what the best parse is, that is, which parse has the highest rating by virtue of the H-function. We iterate until no word changes its optimal parse, which empirically is typically less than five iterations on the entire lexicon. 22 We now have an initial split of all words into stem plus suffix. Even for words like this and stomach we have such an initial split.</Paragraph>
    </Section>
    <Section position="2" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
4.2 Second Heuristic
</SectionTitle>
      <Paragraph position="0"> The second approach that we have employed provides a much more rapid convergence on the suffixes of a language. Since our goal presently is to identify word-final suffixes, we assume by convention that all words end with an end-of-word symbol (traditionally &amp;quot;#'), and we then tally the counts of all n-grams of length between two and six letters that appear word finally. Thus, for example, the word elephant# contains one occurrence of the word-final bigram t#, one occurrence of the word-final trigram nt#, and so forth; we stop at 6-grams, on the grounds that no grammatical morphemes require more than five letters in the languages we are dealing with. We also require that the n-gram in question be a proper substring of its word.</Paragraph>
      <Paragraph position="1"> We employ as a rough indicator of likelihood that such an n-gram nln2.., nk is a grammatical morpheme the measure: \[nln2...nk\] log \[nln2...nk\] Total count of k-grams \[n1-~2\] -(~-k\]' which we may refer to as the weighted mutual information. We choose the top 100 n-grams on the basis of this measure as our set of candidate suffixes.</Paragraph>
      <Paragraph position="2"> We should bear in mind that this ranking will be guaranteed to give incorrect results as well as correct ones; for example, while ing is very highly ranked in an English corpus, ting and ng will also be highly ranked, the former because so many stems end in t, the latter because all ings end in ng, but of the three, only ing is a morpheme in English.</Paragraph>
      <Paragraph position="3"> We then parse all words into stem plus suffix if such a parse is possible using a suffix from this candidate set. A considerable number of words will have more than one such parse under those conditions, and we utilize the figure of merit described in the preceding section to choose among those potential parses.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
4.3 Evaluating the Results of Initial Word Splitting
</SectionTitle>
      <Paragraph position="0"> Regardless of which of the two approaches we have taken, our task now is to decide which splits are worth keeping, which ones need to be dropped, and which ones need to be modified. 23 In addition, if we follow the take-all-splits approach, we have many 22 Experimenting with other functions suggests empirically that the details of our choices for a figure of merit, and the distribution reported in the text, are relatively unimportant. As long as the measurement is capable of ensuring that the cuts are not strongly pushed towards the periphery, the results we get are robust.</Paragraph>
      <Paragraph position="1"> 23 Various versions of Harris's method of morpheme identification can be used as well. Harris's approach has the interesting characteristic (unlike the heuristics discussed in the text) that it is possible to impose restrictions that improve its precision while at the same time worsening its recall to unacceptably low levels. In work in progress, we are exploring the consequences of using such an initial heuristic with significantly higher precision, while depending on MDL considerations to extend the recall of the entire morphology.</Paragraph>
      <Paragraph position="2">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language splits which (from our external vantage point) are splits between prefix and stem: words begim~ng with de (defense, demand, delete, etc.) will at this point all be split after the initial de. So there is work to be done, and for this we return to the central notion of the signature.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="172" end_page="174" type="metho">
    <SectionTitle>
5. Signatures
</SectionTitle>
    <Paragraph position="0"> Each word now has been assigned an optimal split into stem and suffix by the initial heuristic chosen, and we consider henceforth only the best parse for that word, and we retain only those stems and suffixes that were optimal for at least one word. For each stem, we make a list of those suffixes that appear with it, and we call an alphabetized list of such suffixes (separated by an arbitrary symbol, such as period) the stem's signature; we may think of it as a miniparadigm. For example, in one English corpus, the stems despair, pity, appeal, and insult appear with the suffixes ing and ingly. However, they also appear as freestanding words, and so we use the word NULL, to indicate a zero suffix. Thus their signature is NULL.ing.ingly. Similarly, the stems assist and ignor are assigned the signature ance.ant.ed.ing in a certain corpus. Because each stem is associated with exactly one signature, we will also use the term signature to refer to the set of affixes along with the associated set of stems when no ambiguity arises.</Paragraph>
    <Paragraph position="1"> We establish a data structure of all signatures, keeping track for each signature of which stems are associated with that signature. As an initial heuristic, subject to correction below, we discard all signatures that are associated with only one stem (these latter form the overwhelming majority, well over 90%) and all signatures with only one suffix. The remaining signatures we shall call regular signatures, and we will call all of the suffixes that we find in them the regular suffixes. As we shall see, the regular suffixes are not quite the suffixes we would like to establish for the language, but they are a very good approximation, and constitute a good initial analysis. The nonregular signatures produced by the take-all-splits approach are typically of no interest, as examples such as ch.e.erial.erials.rimony.rons.uring and el.ezed.nce.reupon.ther illustrate.</Paragraph>
    <Paragraph position="2"> The reader may identify the single English pseudostem that occurs with each of these signatures.</Paragraph>
    <Paragraph position="3"> The regular signatures are thus those that specify exactly the entire set of suffixes used by at least two stems in the corpus. The presence of a signature rests upon the existence of a structure as in (6), where there are at least two members present in each column, and all combinations indicated in this structure are present in the corpus, and, in addition, each stem is found with no other suffix. (This last condition does not hold for the suffixes; a suffix may well appear in other signatures, and this is the difference between stems and affixes.) 24</Paragraph>
    <Paragraph position="5"> If we have a morphological pattern of five suffixes, let us say, and there is a large set of stems that appear with all five suffixes, then that set will give rise to a regular signature with five suffixal members. This simple pattern would be perturbed by the (for our purpose) extraneous fact that a stem appearing with these suffixes 24 Langer 1991 discusses some of the historical origins of this criterion, known in the literature as a Greenburg square (Greenberg 1957). As Langer points out, important antecedents in the literature include Bloomfield's brief discussion (1933, 161) as well as Nida (1948, 1949).</Paragraph>
    <Paragraph position="6">  Computational Linguistics Volume 27, Number 2 should also appear with some other suffix; and if all stems that associate with these five suffixes appear with idiosyncratic suffixes (i.e., each different from the others), then the signature of those five suffixes would never emerge. In general, however, in a given corpus, a good proportion of stems appears with a complete set of what a grammarian would take to be the paradigmatic set of suffixes for its class: this will be neither the stems with the highest nor the stems with the lowest frequency, but those in between. In addition, there will be a large range of words with no acceptable morphological analysis, which is just as it should be: John, stomach, the, and so forth.</Paragraph>
    <Paragraph position="7"> To get a sense of what are identified as regular signatures in a language such as English, let us look at the results of a preliminary analysis in Table 2 of the 86,976 words of The Adventures of Tom Sawyer, by Mark Twain. The signatures in Table 2 are ordered by the breadth of a signature, defined as follows. A signature C/r has both a stem count (the number of stems associated with it) and an affix count (the number of affixes it contains), and we use log (stem count) ~ log (affix count) as a rough guide to the centrality of a signature in the corpus. The suffixes identified are given in Table 3 for the final analysis of this text.</Paragraph>
    <Paragraph position="8"> In this corpus of some 87,000 words, there are 202 regular signatures identified through the procedure we have outlined so far (that is, preceding the refining operations described in the next section), and 803 signatures composed entirely of regular suffixes (the 601 additional signatures either have only one suffix, or pertain to only a single stem).</Paragraph>
    <Paragraph position="9"> The top five signatures are: NULL.ed.ing, e.ed.ing, NULL.s, NULL.ed.s, and NULL.ed.ing.s; the third is primarily composed of noun stems (though it includes a few words from other categories--hundred, bleed, new), while the others are verb stems. Number 7, NULL.ly, identifies 105 words, of which all are adjectives (apprehensive, sumptuous, gay .... ) except for Sal, name, love, shape, and perhaps earth. The results in English are typical of the results in the other European languages that I have studied.</Paragraph>
    <Paragraph position="10"> These results, then, are derived by the application of the heuristics described above. The overall sketch of the morphology of the language is quite reasonable already in its outlines. Nevertheless, the results, when studied up close, show that there remain a good number of errors that must be uncovered using additional heuristics and evaluated using the MDL measure. These errors may be organized in the following ways:</Paragraph>
    <Paragraph position="12"> The collapsing of two suffixes into one: for example, we find the suffix ings here; in most corpora, the equally spurious suffix ments is found.</Paragraph>
    <Paragraph position="13"> The systematic inclusion of stem-final material into a set of (spurious) suffixes. In English, for example, the high frequency of stem-final ts can lead the system to analyze a set of suffixes as in the spurious signature ted.ting.ts, or ted.tion.</Paragraph>
    <Paragraph position="14"> The inclusion of spurious signatures, largely derived from short stems and short suffixes, and the related question of the extent of the inclusion of signatures based on real suffixes but overapplied. For example, s is a real suffix of English, but not every word ending in s should be analyzed as containing that suffix. On the other hand, every word ending in ness should be analyzed as containing that suffix (in this corpus, this reveals the stems: selfish, uneasi, wretched, loveli, unkind, cheeri, wakeful, drowsi, cleanli, outrageous, and loneli). In the initial analysis of Tom Sawyer, the stem ca is posited with the signature n.n't.p.red.st.t.</Paragraph>
    <Paragraph position="15">  The failure to break all words actually containing the same stem in a consistent fashion: for example, the stem abbreviate with the signature NULL.d.s is not related to abbreviat with the signature ing.</Paragraph>
    <Paragraph position="16"> Stems may be related in a language without being identical. The stem win may be identified as appearing with the signature NULL.s and the stem winn may be identified with the signature er.ing, but these stems should be related in the morphology.</Paragraph>
    <Paragraph position="17"> In the next section, we discuss some of the approaches we have taken to resolving these problems.</Paragraph>
    <Paragraph position="19"> st Signature NULL.ly.st, for stems ence such as safeen behold, deal weak, sunk, etc. ily le Error: analyzed le.ly for e.y (stems ward such as feeb-, audib-, simp-).</Paragraph>
    <Paragraph position="20"> al ation n't led nce Signature nce.nt, for stems fragr-, 'd dista-, indiffereent Spurious: triage problem (pot-ent) ry rious tion r rs ter triage problem ned k triage problem ning ful age ion h '11 te an triage problem ant ness r's nt see above ance novel, uncertain, six, proper triage problem error: stems such as glo- with signature rious.ry error: stems such as glo- with signature rious.ry error: r should be in stem awake-ned, white-ned, thin-ned begin-ning, run-ning triage problem should be -ate (e.g., punctua-te) triumph-ant, expect-ant error</Paragraph>
  </Section>
  <Section position="8" start_page="174" end_page="177" type="metho">
    <SectionTitle>
6. Optimizing Description Length Using Heuristics and MDL
</SectionTitle>
    <Paragraph position="0"> We can use the description length of the grammar formulated in (2) and (3) to evaluate any proposed revision, as we have already observed: note the description length of the grammar and the compressed corpus, perform a modification of the grammar, recompute the two lengths, and see if the modification improved the resulting description length. 25 25 This computation is rather lengthy, and in actual practice it may be preferable to replace it with far faster approaches to testing a change. One way to speed up the task is to compute the differential of the MDL function, so that we can directly compute the change in description length given some prior changes in the variables that define the morphology that are modified in the hypothetical change being evaluated (see the Appendix). The second way to speed up the task is, again, to use heuristics to identify clear cases for which full description length computation is not necessary, and to identify a smaller number of cases where fine description length is appropriate. For example, in the case  Goldsmith Unsupervised Learning of the Morphology of a Natural Language Following the morphological analysis of words described in the previous section, suffixes are checked to determine if they are spurious amalgams of independently motivated suffixes: ments is typically, but wrongly, analyzed as a suffix. Upon identification of such suffixes as spurious, the vocabulary containing these words is reanalyzed. For example, in Tom Sawyer, the suffix ings is split into ing and s, and thus the word beings is split into being plus s; the word being is, of course, already in the lexicon.</Paragraph>
    <Paragraph position="1"> The word breathings is similarly reanalyzed as breathing plus s, but the word breathing is not found in the lexicon; it is entered, with the morphological analysis breath+ing.</Paragraph>
    <Paragraph position="2"> Words that already existed include chafing, dripping, evening, feeling, and flogging, while new &amp;quot;virtual&amp;quot; words include belonging, bustling, chafing, and fastening. The only new word that arises that is worthy of notice is jing, derived from the word jings found in Twain's expression by jings! In a larger corpus of 500,000 words, 64 suffixes are tested for splitting, and 31 are split, including tions, ists, ians, ened, lines, ents, and ively. Note that what it means to say that &amp;quot;suffixes are checked to see if they are spurious amalgams&amp;quot; is that each suffix is checked to see if it is the concatenation of two independently existing suffixes, and then if that is the case, the entire description length of the corpus is recomputed under the alternative analysis; the reanalysis is adopted if and only if the description length decreases. The same holds for the other heuristics discussed immediately below. 26 Following this stage, the signatures are studied to determine if there is a consistent pattern in which all suffixes from the signature begin with the same letter or sequence of letters, as in te.ting.ts. 27 Such signatures are evaluated to determine if the description length improves when such a signature is modified to become e.ing.s, etc. It is necessary to precede this analysis by one in which all signatures are removed which consist of a single suffix composed of a single letter. This set of signatures includes, for example, the singleton signature e, which is a perfectly valid suffix in English; however, if we permit all words ending in e, but having no other related forms, to be analyzed as containing the suffix e, then the e will be inappropriately highly valued in the analysis.</Paragraph>
    <Paragraph position="3"> (We return to this question in Section 11, where we address the question of how many occurrences of a stem with a single suffix we would expect to find in a corpus.) In the next stage of analysis, triage, signatures containing a small number of stems or a single suffix are explored in greater detail. The challenge of triage is to determine when the data is rich and strong enough to support the existence of a linguistically real signature. A special case of this is the question of how many stems must exist to motivate the existence of a signature (and hence, a morphological analysis for the words in question) when the stems only appear with a single suffix. For example, if a set of words appear in English ending with hood, should the morphological analysis split the words in that fashion, even if the stems thereby created appear with no other suffixes? And, at the other extreme, what about a corpus which contains the words look, book, loot, and boot? Does that data motivate the signature l.k, for the stems boo and loo? The matter is rendered more complex by a number of factors. The length of the stems and suffixes in question clearly plays a role: suffixes of one letter are, all other things being equal, suspicious; the pair of stems Ioo and boo, appearing with the signature k.t, does not provide an example of a convincing mentioned in the text, that of determining whether a suffix such as ments should always be split into two independently motivated suffixes ment and s, we can compute the fraction of words ending in ments that correspond to freestanding words ending in ment. Empirical observation suggests that ratios over 0.5 should always be split into two suffixes, ratios under 0.3 should not be split, and those in between must be studied with more care.</Paragraph>
    <Paragraph position="4">  Computational Linguistics Volume 27, Number 2 linguistic pattern. On the other hand, if the suffix is long enough, even one stem may be enough to motivate a signature, especially if the suffix in question is otherwise quite frequent in the language. A single stem occurring with a single pair of suffixes may be a very convincing signature for other reasons as well. In Italian, for example, even in a relatively small corpus we are likely to find a signature such as a.ando.ano.are.ata.ate.ati.ato.azione.~ with several stems in it; once we are sure that the 10-suffix signature is correct, then the discovery of a subsignature along with a stem is perfectly natural, and we would not expect to find multiple stems associated with each of the occurring combinations. (A similar example in English from Tom Sawyer is NULL.ed.ful.ing.ive.less for the single stem rest.) And a signature may be &amp;quot;contaminated,&amp;quot; so to speak, by a spurious intruder. A corpus containing rag, rage, raged, raging, and rags gave rise to a signature: NULL.e.ed.ing.s for the stem rag. It seems clear that we need to use information that we have obtained regarding the larger, robust patterns of suffix combinations in the language to influence our decisions regarding smaller combinations. We return to the matter of triage below. null We are currently experimenting with methods to improve the identification of related stems. Current efforts yield interesting but inconclusive results. We compare all pairs of stems to determine whether they can be related by a simple substitution process (one letter for none, one letter for one letter, one letter for two letters), ignoring those pairs that are related by virtue of one being the stem of the other already within the analysis. We collect all such rules, and compare by frequency. In a 500,000-word English corpus, the top two such pairs of 1:1 relationships are (1) 46 stems related by a final d/s alternation, including intrud/intrus, apprendend/apprenhens, provid/provis, suspend/suspens, and elud/elus, and (2) 43 stems related by a final i/y alternation, including reli/rely, ordinari/ordinary, decri/decry, suppli/supply, and accompani/accompany. This approach can quickly locate patterns of allomorphy that are well known in the European languages (e.g., alternation between a and/~ in German, between o and ue in Spanish, between c and q in French). However, we do not currently have a satisfactory means of segregating meaningful cases, such as these, from the (typically less frequent and) spurious cases of stems whose forms are parallel but ultimately not related.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML