File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/p86-1019_metho.xml
Size: 22,193 bytes
Last Modified: 2025-10-06 14:11:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P86-1019"> <Title>COMPUTER METHODS FOR MORPHOLOGICAL ANALYSIS</Title> <Section position="3" start_page="120" end_page="121" type="metho"> <SectionTitle> 2. Approach and Tools </SectionTitle> <Paragraph position="0"> Our approach to computer-aided morphological research is to analyse a large number of English words in terms of a somewhat smaller list of monomorphemic base words. For each morphologically complex word on the original list which can be analysed down to one of our bases, we obtain a structure which shows the affixes and marks the parts-of-speech of the components.</Paragraph> <Paragraph position="1"> Thus, for beautification, we obtain the structure <<<beauty>N +ify>V +ion>N.</Paragraph> <Paragraph position="2"> In this structure, the noun beauty is the ultimate base and +ify and +ion are the affixes.</Paragraph> <Paragraph position="3"> After analysis, we obtain, for each base, a list of all words derived from it, together with their morphological structures. We then study these lists and the patterns of affixation they exemplify, seeking generalizations.</Paragraph> <Paragraph position="4"> Section 3 will give an expanded description of the approach together with a detailed account of one of the studies.</Paragraph> <Paragraph position="5"> We have two classes of tools: word lists and computer programs. There are basically four word lists.</Paragraph> <Paragraph position="6"> 1. The Kucera and Francis (K&F) word list, from Kucera and Francis (1967), contains 50,000 words listed in order of frequency of occurrence.</Paragraph> <Paragraph position="7"> 2. The BASE WORD LIST consists of approximately 3,000 monomorphemic words. It was drawn from the top of the K&F list by the GETBASES procedure described below.</Paragraph> <Paragraph position="8"> 3. The UDICT word list consists of about 63,000 words, drawn mainly from Merriam (1963). The UDICT program, described below, uses this list in conjunction with our word grammar to produce morphological analyses of input words. The UDICT word list is a superset of the base word list; for each word, it contains the major category as well as other grammatical information.</Paragraph> <Paragraph position="9"> 4. The &quot;complete&quot; word list consists of approximately one quarter million words drawn from an international-sized dictionary. Each entry on this list is a single orthographic word, with no additional information. These are the words which are morphologically analysed down to the bases on our base list.</Paragraph> <Paragraph position="10"> 5. We have prepared reverse spelling word lists based on each of the other lists. A particularly useful tool has been a group of reverse lists derived from Merriam(1963) and separated by major category.</Paragraph> <Paragraph position="11"> These lists provide ready access to sets of words having the same suffix.</Paragraph> <Paragraph position="12"> Our computer programs include the following.</Paragraph> <Paragraph position="13"> 1. UDICT. This is a general purpose dictionary access system intended for use by computer programs.</Paragraph> <Paragraph position="14"> (The UDICT program was originally developed for the EPISTLE text-critiquing system, as described in Heidorn, et al. (1982).) It contains, among other things, the morphological analysis logic and the word grammar that we use to produce the word structures previously described.</Paragraph> <Paragraph position="15"> 2. GETBASES. This program produces a list of monomorphemic words from the original K&F frequency lists. Basically, it operates by invoking UDICT for each word. The output consists of words which are morphologically simple, and the bases of morphologically complex words. (Among other things, this allows us to handle the fact that the original K&F lists are not lemmatised.) The resulting list, with duplicates removed, is our &quot;base list&quot;.</Paragraph> <Paragraph position="16"> 3. ANALYSE. ANALYSE takes each entry from the complete word list. It invokes the UDICT program to give a morphological analysis for that word. Any word whose ultimate base is in the base list is considered a derived word. For each word from the base list, the final result is a list of pairs consisting of \[derived-word, structure\] The data produced by ANALYSE is further processed by the next four programs.</Paragraph> <Paragraph position="17"> 4. ANALYSES. This program allows us to inspect the set of \[derived-word,structure\] pairs associated with any word in the base list. For example, its output for the word beauty is shown in Figure 1. In the indicating which bases take which single affixes to form another word. One matrix is produced for each of the major categories: nouns, adjectives, and verbs. More detail on the contents and use of these matrices is given in Section 3.</Paragraph> <Paragraph position="18"> 6. MORPH. This program uses the matrices created by SASDS to list bases that accept one or more given affixes.</Paragraph> <Paragraph position="19"> 7. SAS. (SAS is a trademark of the SAS Institute, Inc., Cary, North Carolina.) This is a set of statistical analysis programs which can be used to analyse the matrices produced by SASDS.</Paragraph> <Paragraph position="20"> 8. WordSmith. This is an on-line dictionary system, developed at IBM, that provides fast and convenient reference to a variety of types of dictionary information. The WordSmith functions of most use in our current research are the REVERSE dimension (for listing words that end the same way), the WEBSTER7 application (for checking the definitions of words we don't know), and the UDED application (for checking and revising the contents of the UDICT word list).</Paragraph> </Section> <Section position="4" start_page="121" end_page="123" type="metho"> <SectionTitle> 3. Detailed Methods </SectionTitle> <Paragraph position="0"> Our research can be conveniently described as a two stage process. During the first stage, we endeavored to produce a list of morphologically active base words from which other English words can be derived by affixation.</Paragraph> <Paragraph position="1"> The term &quot;morphologically active&quot; means that a word can potentially serve as the base of a large number of affixed derivatives. Having such words is important for stage two, where patterns of affixation become more obvious when we have more instances of bases that exhibit them. We conjectured that words which were frequent in the language have a higher likelihood of participating in word-formation processes, so we began our search with the 6,000 most frequent words in the K&F word list.</Paragraph> <Paragraph position="2"> The GETBASES program segregated these words into two categories: morphologically simple words (i.e., those for which UDICT produced a structure containing no affixes) and morphologically complex words. At the same time, GETBASES discarded words that were not morphologically interesting; these included proper nouns, words not belonging to the major categories, and non-lemma forms of irregular words. (For example, the past participle done does not take affixes, although its lemma do will accept #able as in doable) GETBASES next considered the ultimate bases of the morphologically complex words. Any base which did not also appear in the K&F word list was discarded. The remaining bases were added to the original list of morphologically simple words. After removing duplicates, we obtained a list of approximately 3,000 very frequent bases which we conjectured were morphologically active.</Paragraph> <Paragraph position="3"> Development of the GETBASES program was an iterative process. The primary type of change made at each iteration was to correct and improve the UDICT grammar and morphological analysis mechanism. Because the constraints on the output of GETBASES were clear (and because it was obvious when we failed to meet them), the creation of GETBASES proved to be a very effective way to guide improvements to UDICT. The more important of these improvements are discussed in Section 4.3.</Paragraph> <Paragraph position="4"> For stage two of our project, we used ANALYSE to process the &quot;complete&quot; word list, as described in Section 2. That is, for each word, UDICT was asked to produce a morphological analysis. Whenever the ultimate base for one of the (morphologically complex) words appeared on our list of 3,000 bases, the derived word and its structure were added to the list of such pairs'associated with that base. ANALYSE yielded, therefore, a list of 3,000 sublists of \[word,structure\] pairs, with each sublist named by one of our base words. We called this result BASELIST.</Paragraph> <Paragraph position="5"> Our first in-depth study of this material involved the process of adding a single affix to a base word to form another word. By applying SASDS to BASELIST, we obtained 3 matrices showing for each base which affixes it did and did not accept. The noun matrix contained 1900 bases; the adjective matrix contained 850 bases; and the verb matrix contained 1600 bases. (Since the original list of bases contained words belonging to multiple major categories, these counts add up to more than 3,000. The ANALYSE program used the part-of-speech assignments from UDICT to disambiguate such homographs.) Figure 2 contains samples taken from the noun, adjective, and verb matrices. For each matrix, the horizontal axis shows the complete list of affixes (for that part-ofspeech) covered in our study. The vertical axes give contiguous samples of our ultimate bases.</Paragraph> <Paragraph position="6"> Our results are by no means perfect. Some of our misanalyses come about because of missing constraints in our grammar. The process of correcting these errors is discussed in Section 4. Sometimes there are genuine ambiguities, as with the words refuse (<re# <fuse>V>V) and preserve (<pre# <serve>V>V). In the absence of information about how an input word is pronounced or what it means, it is difficult to imagine how our analyser can avoid producing the structures shown.</Paragraph> <Paragraph position="7"> Some of our problems are caused by the fact that the complete word list is alternately too large and not large enough. It includes the word artal, (plural of rod, a Middle Eastern unit of weight) which our rules dutifully, if incorrectly, analyse as <<art>N +al >A. Yet it fail~ to include angelhood, even though angel bears the \[+human\] feature that #hood seems to require.</Paragraph> <Paragraph position="8"> Despite such errors, however, most of the analyses in these matrices are correct and provide a useful basis for our analytical work. We employed a variety of techniques to examine these matrices, and the BASELIST.</Paragraph> <Paragraph position="9"> Our primary approach was to use SAS, MORPH, and ANALYSES to suggest hypotheses about affix attachment. We then used MORPH, WordSmith, and UDICT (via changes to the grammar) to test and verify those hypotheses. Hypotheses which have so far survived our tests and our skepticism are given in Section 4.</Paragraph> </Section> <Section position="5" start_page="123" end_page="124" type="metho"> <SectionTitle> 4. Results </SectionTitle> <Paragraph position="0"> Using the mcthods described, we have produced, results which enhance our understanding of morphological processes, and have produced improvements in the morphological analysis system. We present here some of what we have already learned. Continued research using our approach and data will yield further results.</Paragraph> <Section position="1" start_page="123" end_page="124" type="sub_section"> <SectionTitle> 4.1 Methodological Results </SectionTitle> <Paragraph position="0"> It is significant that we were able to perform this research with generally available materials. With the exception of the K&F word frequency list, our word lists were obtained from commercially available dictionaries.</Paragraph> <Paragraph position="1"> This work forms a natural accompaniment to another Lexical Systems project, reported in Chodorow, et al.</Paragraph> <Paragraph position="2"> (1985), in which semantic information is extracted from commercial dictioriaries. As the morphology project identifies lexical information that is relevant, variations of the semantic extraction methods may be used to populate the dictionary with that information.</Paragraph> <Paragraph position="3"> As has already been pointed out, our rules leave a residue of mis-analysed words, which shows up (for example) as errors in our matrices. Although we can never eliminate this residue, we can reduce its size by introducing additional constraints into our grammar as we discover them. For example, chicken was mis-analysed as <<chi c>A +en>V. As we show in greater detail below, we now know that the +en suffix requires a \[+Germanic\] base; since chic is \[-Germanic\[, we can avoid the mis-analysis. Similarly we can avoid analysing legal as <<leg>N +al>A by observing that +al requires a \[-Germanic\] base while leg is \[+Germanic\]. Finally, we now have several ways to avoid the mis-analysis of maize as <<ma>N +ize>V, including the observation that +ize does not accept monosyllabic bases. We don't expect, however, to find a constraint that will deal correctly with words like artal.</Paragraph> <Paragraph position="4"> In the introduction, we pointed out that one of our goals was to build a system which can handle coinages. With respect to the 63,000-word UDICT word list, the quarter-million-word complete word list can be viewed as consisting mostly of coinages. The fact that our analyser has been largely successful at analysing the words on the complete word list means that we are close to meeting our goal. What remains is to exploit our research results in order to reduce our mis-analysed residue as much as possible.</Paragraph> </Section> </Section> <Section position="6" start_page="124" end_page="125" type="metho"> <SectionTitle> 4. 2 Linguistic Results </SectionTitle> <Paragraph position="0"> Linguistically significant generalizations that have resulted so far can be encoded in the form of conditions and assertions in our word formation rule grammar (see Byrd (1983a)). They typically constrain interactions between specific affixes and particular groups of words.</Paragraph> <Paragraph position="1"> The linguistic constraints fall into at least three categories: (1) syllabic structure of the base word; (2) phonemic nature of the final segment of the base word; and (3) etymology of the base word, both derived and underived. Each of these is covered below. Some of these constraints have been informally observed by other researchers, but some have not.</Paragraph> <Paragraph position="2"> Constraints on the Syllabic structure of the base word. It is commonly known that the length of a base word can affect an inflectional process such as comparative formation in English. One can distinguish between short and long words where \[+short\] indicates two or fewer syllables and \[+long\] indicates two or more syllables.</Paragraph> <Paragraph position="3"> For example, a word such as big which is \[+short\] can take the affixes -er and -est. In contrast, words which are \[-short\] cannot, cf. possible, *possibler, *possiblest. (There are additional constraints on comparative formation, which we will not go into here. We give here only the simplified version.) We have found that other suffixes appear to require the feature \[+short\]. For example, nouns that take the suffix #ish tend to be \[+short\]. The actual results of our analysis show that no words of four syllables took #ish and only seven words of three syllables took #ish. In contrast, a total of 221 one and two syllable words took this suffix. The suffix thus preferred one syllable words over two syllable words by a factor of four (178 one syllable words over 43 two syllable words). Compare boy~boyish with mimeograph/mimeographish. This is not to say that a word like mimeographish is necessarily ill-formed, but that it is less likely to occur, and in fact did not occur in a list like Merriam (1963).</Paragraph> <Paragraph position="4"> Two other suffixes also appear to select for number of syllables in the base word. In this case the denominal verb suffixes +ize and +ify are nearly in complementary distribution. Our data show that of the approximately 200 bases which take +ize, only seven are monosyllabic.</Paragraph> <Paragraph position="5"> Compare this with the suffix +tfy which selects for about 100 bases, of which only one is trisyllabic and 17 are disyllabic. Thus, +t.PSv tends to select for \[+short\] bases while +ize tends to select for \[+long\] ones. As with #ish, there appears to be motivation for syllabic structure constraints on morphological rules.</Paragraph> <Paragraph position="6"> In the case of +ize and +ify it appears that the syllabic structure of the suffix interacts with the syllabic structure of the base. Informally, the longer suffix selects for a \[+short\] base, and the shorter suffix selects for a \[+long\] base. Our speculation is that this may be related to the notion of optimal target metrical structure as discussed in Hayes (1984). This notion, however, is the subject of future research.</Paragraph> <Paragraph position="7"> The Final Segment of the Base Word. The phonemic nature of the final segment appears to affect the propensity of a base to take an affix. Consider the fact that there occurred some 48 +ary adjectives derived from nouns in our data. Of these, 46 are formed from bases ending with alveolars. The category alveolar includes the phonemes /t/, /d/, /n/, /s/, /z/, and/1/. The two exceptions are customary and palmary. Again, in a word recognizer, if a base does not end in one of these phonemes, then it is not likely to be able to serve as the base of +ary. We have also found that the ual spelling of the +al suffix prefers a preceding alveolar, such as gradual, sexual, habitual.</Paragraph> <Paragraph position="8"> Another result related to the alveolar requirement is an even more stringent requirement of the nominalizing suffix +ity. Of the approximately 150 nouns taking +ity, only three end in the phoneme /t/ (chastity, sacrosanctity, and vastity). In addition the adjectivizer +cy seems also to attach primarily to bases ending in /t/. The exceptions are normalcy and supremacy.</Paragraph> <Paragraph position="9"> Etymology of the Base Word. The feature \[+Germanic\] is said to be of critical importance in the analysis of English morphology (Chomsky and Halle 1968, Marchand 1969). In two cases our data show this to be true. The suffix +en, which creates verbs from adjectives, as in moist~moisten, yielded a total of fifty-five correct analyses. Of these, forty-three appear in Merriam (1963), and of these forty-one are of Germanic origin. The remaining two are quieten and neaten. The former is found only in some dialects. It is clear that +en verbs aI'e:oyerwhelmingly formed on \[+Germanic\] bases.</Paragraph> <Paragraph position="10"> The feature \[Germanic\] is also significant with +al adjectives. In contrast to the +en stfffix, +al selects for the feature \[-Germanic\]. In our data, there were some two hundred and seventy two words analysed as adjectives derived from nouns by +al suffixation. Of the base words which appear in Merriam (1963), only one, bridal, is of Germanic origin. However, interestingly, it turns out that the analysis <<bride>N +al >A is spurious, since bridal is the reflex of an Old English form brydealu, a noun referring to the wedding feast. The adjective bridal is not derived from bride. Rather it was zero-derived historically from the nominal form.</Paragraph> <Paragraph position="11"> Finally, other findings from our analysis show that no words formed with the Anglo-Saxon prefixes a+, be+ or for+ will negate with the Latinate prefixes non# or in#. This supports the findings of Marchand (1969).</Paragraph> <Paragraph position="12"> Observe that in these examples, the constraint applies between affixes, rather than between an affix and a base. The addition of an affix thus creates a new complex lexical item, complete with additional properties which can constrain further affixation.</Paragraph> <Paragraph position="13"> In sum, our sample findings suggest a number of new constraints on morphological rules. In addition we provide evidence and support for the observations of others.</Paragraph> <Section position="1" start_page="125" end_page="125" type="sub_section"> <SectionTitle> 4.3 Improvements to the Implementation </SectionTitle> <Paragraph position="0"> In addition to using our linguistic results to change the grammar, we have also made a variety of improvements to UDICT's morphological analyser which interprets that grammar. Some have been for our own convenience, such as streamlining the procedures for changing and compiling the grammar. Two of the improvements, however, result directly from the analysis of our word lists and files. These improvements represent generalizations over classes of affixes.</Paragraph> <Paragraph position="1"> First, we observed that, with the exception of be, do, and go, no base spelled with fewer than three characters ever takes an affix. Adding code to the analyser to restrict the size of bases has had an important effect in avoiding spurious analyses.</Paragraph> <Paragraph position="2"> A more substantial result is that we have added to UDICT a comprehensive set of English spelling rules which make the right spelling adjustments to the base of a suffix virtually all of the time. These rules, for example, know when and when not to double final consonants, when to retain silent e preceding a suffix beginning with a vowel, and when to add k to a base ending in c. These rules are a critical aspect Of UDICT's ability to robustly handle normal English input and to avoid misanalyses.</Paragraph> </Section> </Section> class="xml-element"></Paper>