File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/j01-2001_abstr.xml
Size: 9,780 bytes
Last Modified: 2025-10-06 13:41:59
<?xml version="1.0" standalone="yes"?> <Paper uid="J01-2001"> <Title>Unsupervised Learning of the Morphology of a Natural Language</Title> <Section position="2" start_page="0" end_page="154" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> This is a report on the present results of a study on unsupervised acquisition of morphology. 1 The central task of morphological analysis is the segmentation of words into the components that form the word by the operation of concatenation.</Paragraph> <Paragraph position="1"> While that view is not free of controversy, it remains the traditional conception of morphology, and the one that we shall employ here. 2 Issues of interface with phonology, traditionally known as morphophonology, and with syntax are not directly addressed. 3 While some of the discussion is relevant to the unrestricted set of languages, some of the assumptions made in the implementation restrict the useful application of the algorithms to languages in which the average number of affixes per word is less than what is found in such languages as Finnish, Hungarian, and Swahili, and we restrict our testing in the present report to more widely studied European languages. Our general goal, however, is the treatment of unrestricted natural languages.</Paragraph> <Paragraph position="2"> 1998, and I am grateful for the support I received there. A first version was written in September, 1998, and a much-revised version was completed in December, 1999. This work was also supported in part by a grant from the Argonne National Laboratory-University of Chicago consortium, which I thank for its support. I am also grateful for helpful discussion of this material with a number of people, including Carl de Marcken, Jason Eisner, Zhiyi Chi, Derrick Higgins, Jorma Rissanen, Janos Simon, Svetlana Soglasnova, Hisami Suzuki, and Jessie Pinkham. As noted below, I owe a great deal to the remarkable work reported in de Marcken's dissertation, without which I would not have undertaken the work described here. I am grateful as well to several anonymous reviewers for their considerable improvements to the content of this paper.</Paragraph> <Paragraph position="3"> 2 Sylvain Neuvel has recently produced an interesting computational implementation of a theory of morphology that does not have a place for morphemes, as described at http://www.neuvel.net. It is well established that nonconcatenative morphology is found in some scattered language families, notably Semitic and Penutian. African tone languages require simultaneous morphological analyses of the tonal and the segmental material.</Paragraph> <Paragraph position="4"> 3 But see the following note.</Paragraph> <Paragraph position="5"> @ 2001 Association for Computational Linguistics Computational Linguistics Volume 27, Number 2 The program in question takes a text file as its input (typically in the range of 5,000 to 1,000,000 words) and produces a partial morphological analysis of most of the words of the corpus; the goal is to produce an output that matches as closely as possible the analysis that would be given by a human morphologist. It performs unsupervised learning in the sense that the program's sole input is the corpus; we provide the program with the tools to analyze, but no dictionary and no morphological rules particular to any specific language. At present, the goal of the program is restricted to providing the correct analysis of words into component pieces (morphemes), though with only a rudimentary categorical labeling.</Paragraph> <Paragraph position="6"> The underlying model that is utilized invokes the principles of the minimum description length (MDL) framework (Rissanen 1989), which provides a helpful perspective for understanding the goals of traditional linguistic analysis. MDL focuses on the analysis of a corpus of data that is optimal by virtue of providing both the most compact representation of the data and the most compact means of extracting that compression from the original data. It thus requires both a quantitative account whose parameters match the original corpus reasonably well (in order to provide the basis for a satisfactory compression) and a spare, elegant account of the overall structure.</Paragraph> <Paragraph position="7"> The novelty of the present account lies in the use of simple statements of morphological patterns (called signatures below), which aid both in quantifying the MDL account and in constructively building a satisfactory morphological grammar (for MDL offers no guidance in the task of seeking the optimal analysis). In addition, the system whose development is described here sets reasonably high goals: the reformulation in algorithmic terms of the strategies of analysis used by traditional morphologists.</Paragraph> <Paragraph position="8"> Developing an unsupervised learner using raw text data as its sole input offers several attractive aspects, both theoretical and practical. At its most theoretical, unsupervised learning constitutes a (partial) linguistic theory, producing a completely explicit relationship between data and analysis of that data. A tradition of considerable age in linguistic theory sees the ultimate justification of an analysis A of any single language L as residing in the possibility of demonstrating that analysis A derives from a particular linguistic theory LT, and that that LT works properly across a range of languages (not just for language L). There can be no better way to make the case that a particular analysis derives from a particular theory than to automate that process, so that all the linguist has to do is to develop the theory-as-computer-algorithm; the application of the theory to a particular language is carried out with no surreptitious help.</Paragraph> <Paragraph position="9"> From a practical point of view, the development of a fully automated morphology generator would be of considerable interest, since we still need good morphologies of many European languages and to produce a morphology of a given language &quot;by hand&quot; can take weeks or months. With the advent of considerable historical text available on-line (such as the ARTFL database of historical French), it is of great interest to develop morphologies of particular stages of a language, and the process of automatic morphology writing can simplify this stage--where there are no native speakers available---considerably.</Paragraph> <Paragraph position="10"> A third motivation for this project is that it can serve as an excellent preparatory phase (in other words, a bootstrapping phase) for an unsupervised grammar acquisition system. As we will see, a significant proportion of the words in a large corpus can be assigned to categories, though the labels that are assigned by the morphological analysis are corpus internal; nonetheless, the assignment of words into distinct morphologically motivated categories can be of great service to a syntax acquisition device.</Paragraph> <Paragraph position="11"> Some signatures from Tom Sawyer.</Paragraph> <Paragraph position="12"> Signature Example Stem Count (type) Token Count NULL.ed.ing betray betrayed betraying 69 864 NULL.ed.ing.s remain remained remaining remains 14 516 NULL.s. cow cows 253 3,414 e.ed.es.ing notice noticed notices noticing 4 62 The problem, then, involves both the determination of the correct morphological split for individual words, and the establishment of accurate categories of stems based on the range of suffixes that they accept: .</Paragraph> <Paragraph position="13"> .</Paragraph> <Paragraph position="14"> Splitting words: We wish to accurately analyze any word into successive morphemes in a fashion that corresponds to the traditional linguistic analysis. Minimally, we wish to identify the stem, as opposed to any inflectional suffixes. Ideally we would also like to identify all the inflectional suffixes on a word which contains a stem that is followed by two or more inflectional suffixes, and we would like to identify derivational prefixes and suffixes. We want to be told that in this corpus, the most important suffixes are -s, -ing, -ed, and so forth, while in the next corpus, the most important suffixes are -e, -en, -heit, -ig, and so on.</Paragraph> <Paragraph position="15"> Of course, the program is not a language identification program, so it will not name the first as &quot;English&quot; and the second as &quot;German&quot; (that is a far easier task), but it will perform the task of deciding for each word what is stem and what is affix.</Paragraph> <Paragraph position="16"> Range of suffixes: The most salient characteristic of a stem in the languages that we will consider here is the range of suffixes with which it can appear. Adjectives in English, for example, will appear with some subset of the suffixes -er, -est, -ity, -hess, etc. We would like to determine automatically what the range of the most regular suffix groups is for the language in question, and rank suffix groupings by order of frequency in the corpus. 4 To give a sense of the results of the program, consider one aspect of its analysis of the novel The Adventures of Tom Sawyer--and this result is consistent, by and large, regardless of the corpus one chooses. Consider the top-ranked signatures, illustrated in Table 1: a signature is an alphabetized list of affixes that appear with a particular stem in a corpus. (A larger list of these patterns of suffixation in English are given in Table 2, in Section 5.) The present morphology learning algorithm is contained in a C++ program called Linguistica that runs on a desktop PC and takes a text file as its input. 5 Analyzing a</Paragraph> </Section> class="xml-element"></Paper>