XML Viewer - j01-2001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/j01-2001_concl.xml
Size: 15,191 bytes
Last Modified: 2025-10-06 13:53:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2001">
  <Title>Unsupervised Learning of the Morphology of a Natural Language</Title>
  <Section position="9" start_page="177" end_page="185" type="concl">
    <SectionTitle>
7. Results
</SectionTitle>
    <Paragraph position="0"> On the whole, the inclusion of the strategies described in the preceding sections leads to very good, but by no means perfect, results. In this section we shall review some of these results qualitatively, some quantitatively, and discuss briefly the origin of the incorrect parses.</Paragraph>
    <Paragraph position="1"> We obtain the most striking result by looking at the top list of signatures in a language, if we have some familiarity with the language: it is almost as if the textbook patterns have been ripped out and placed in a chart. As these examples suggest, the large morphological patterns identified tend to be quite accurately depicted. To illustrate the results on European languages, we include signatures found from a 500,000-word corpus of English (Table 4), a 350,000-word corpus of French (Table 5), Don Quijote, which contains 124,716 words of Spanish (Table 6), a 125,000-word corpus of Latin (Table 7), and 100,000 words and 1,000,000 words of Italian (Tables 8 and 9).</Paragraph>
    <Paragraph position="2"> The 500,000-word (token-count) corpus of English (the first part of the Brown Corpus) contains slightly more than 30,000 distinct words.</Paragraph>
    <Paragraph position="3"> To illustrate the difference of scale that is observed depending on the size of the corpus, compare the signatures obtained in Italian on a corpus of 100,000 words (Table 8) and a corpus of 1,000,000 words (Table 9). When one sees the rich inflectional  pattern emerging, as with the example of the 10 suffixes on first-conjugation stems (a.ando.ano.are.ata.ate.ati.ato.azione.~), one cannot but be struck by the grammatical detail that is emerging from the study of a larger corpus, as 28 Signature 1 is formed from adjectival stems in the fem.sg., fem.pl., masc.pl, and masc.sg, forms; Signature 2 is entirely parallel, based on stems ending with the morpheme -ic/-ich, where ich is used before i and e. Signature 4 is an extension of Signature 2, including nominalized (sg. and pl.) forms. Signature 5 is the large regular verb inflection pattern (seven such verb stems are identified). Signature 3 is a subset of Signature 1, composed of stems accidentally not found in the feminine plural form. Signatures 6 and 8 are primarily masculine nouns, sg., and pl., Signature 10 is feminine nouns, sg., and pl., and the remaining Signatures 7 and 9 are again subsets of the regular adjective pattern of  Turning to French, we may briefly inspect the top 10 signatures that we find in a 350,000-word corpus in Table 5. It is instructive to consider the signature a.aient.ait.ant.e. ent.er.es.~rent.d.de.ds, which is ranked ninth among signatures. It contains a large part of the suffixal pattern from the most common regular conjugation, the first conjugation. null Within the scope of the effort covered by this project, the large-scale generalizations extracted about these languages appear to be quite accurate (leaving for further discussion below the questions of how to link related signatures and related stems). It is equally important to take a finer-grained look at the results and quantify them. To  do this, we have selected from the English and the French analyses a set of 1,000 consecutive words in the alphabetical list of words from the corpus and divided them into distinct sets regarding the analysis provided by the present algorithm. See Tables 10 and 11.</Paragraph>
    <Paragraph position="4"> The first category of analyses, labeled Good, is self-explanatory in the case of most words (e.g., proceed, proceeded, proceeding, proceeds), and many of the errors are equally easy to identify by eye (abide with no analysis, next to abid-e and abid-ing, or Abn-er). Quite honestly, I was surprised how many words there were in which it was difficult to say what the correct analysis was. For example, consider the pair aboli-tion and abolish. The words are clearly related, and abolition clearly has a suffix; but does it have the suffix -ion, -tion, or -ition, and does abolish have the suffix -ish, or -sh? It is hard to say.  In a case of this sort, my policy for assigning success or failure has been influenced by two criteria. The first is that analyses are better insofar as they explicitly relate words that are appropriately parallel in semantics, as in the abolish~abolition case; thus I would  Computational Linguistics Volume 27, Number 2 give credit to either the analysis aboli-tion/aboli-sh or the analysis abol-ition/abol-ish. The second criterion is a bit more subtle. Consider the pair of words alumnus and alumni. Should these be morphologically analyzed in a corpus of English, or rather, should failure to analyze them be penalized for this morphology algorithm? (Compare in like manner alibi or allegretti; do these English words contain suffixes?). My principle has been that if I would have given the system additional credit by virtue of discovering that relationship, I have penalized it if it did not discover it; that is a relatively harsh criterion to apply, to be sure. Should proper names be morphologically analyzed? The answer is often unclear. In the 500,000 word English corpus, we encounter Alex and Alexis, and the latter is analyzed as alex-is. I have scored this as correct, much as I have scored as correct the analyses of Alexand-er and Alexand-re. On the other hand, the failure to analyze Alexeyeva despite the presence of Alex and Alexei does not seem to me to be an error, while the analysis Anab-el has been scored as an error, but John-son (and a bit less obviously Wat-son) have not been treated as errors. 29 Difficult to classify, too, is the treatment of words such as abet~abetted~abetting. The present algorithm selects the uniform stem abet in that case, assigning the signature NULL.ted.ting. Ultimately what we would like to have is a means of indicating that the doubled t is predictable, and that the correct signature is NULL.ed.ing. At present this is not implemented, and I have chosen to mark this as correct, on the grounds that it is more important to identify words with the same stem than to identify the (in some sense) correct signature. Still, unclear cases remain: for example, consider the words accompani-ed/accompani-ment/accompani-st. The word accompany does not appear as such, but the stem accompany is identified in the word accompany-ing. The analysis accompani-st fails to identify the suffix -ist, but it will successfully identify the stem as being the same as the one found in accompanied and accompaniment, which it would not have done if it had associated the i with the suffix. I have, in any event, marked this analysis as wrong, but without much conviction behind the decision. Similarly, the analysis of French putative stem embelli with suffixes e/rent/t passes the low test of treating related words with the same stem, but I have counted it as in error, on the grounds that the analysis is unquestionably one letter off from the correct, traditional analysis of second-conjugation verbs. This points to a more general issue regarding French morphology, which is more complex than that of English. The infinitive ~crire 'to write' would ideally be analyzed as a stem &amp;r plus a derivational suffix i followed by an infinitival suffix re. Since the derivational suffix i occurs in all its inflected forms, it is not unreasonable to find an analysis in which the i is integrated into the stem itself. This is what the algorithm does, employing the stem dcri for the words dcri-re and ~cri-t. Ecrit in turn is the stem for dcrite, ~crite, ~crites, &amp;rits, and ~criture. An alternate stem form dcriv is used for past tense forms (and the nominalization dcrivain) with the suffixes aient, ait, ant, irent, it. The algorithm does not make explicit the connection between these two stems, as it ideally would.</Paragraph>
    <Paragraph position="5"> Thus in the tables, Good indicates the categories of words where the analysis was clearly right, while the incorrect analyses have been broken into several categories.</Paragraph>
    <Paragraph position="6"> Wrong Analysis is for bimorphemic words that are analyzed, but incorrectly analyzed, by the algorithm. Failed to Analyze are the cases of words that are bimorphemic but 29 My inability to determine the correct morphological analysis in a wide range of words that I know perfectly well seems to me to be essentially the same response as has often been observed in the case of speakers of Japanese, Chinese, and Korean when forced to place word boundaries in e-mail romanizations of their language. Ultimately the quality of a morphological analysis must be measured by how well the algorithm handles the clear cases, how well it displays the relationships between words perceived to be related, and how well it serves as the language model for a stochastic morphology of the language in question.</Paragraph>
    <Paragraph position="7">  Goldsmith Unsupervised Learning of the Morphology of a Natural Language for which no analysis was provided by the algorithm, and Spurious Analysis are the cases of words that are not morphologically complex but were analyzed as containing a suffix.</Paragraph>
    <Paragraph position="8"> For both English and French, correct performance is found in 83% of the words; details are presented in Tables 10 and 11. For English, these figures correspond to precision of 829/(829 + 52 + 83) = 85.9%, and recall of 829/(829 + 52 + 36) = 90.4%.</Paragraph>
    <Paragraph position="9"> 8. Triage As noted above, the goal of triage is to determine how many stems must occur in order for the data to be strong enough to support the existence of a linguistically real signature. MDL provides a simple but not altogether satisfactory method of achieving this end.</Paragraph>
    <Paragraph position="10"> Using MDL for this task amounts to determining whether the total description length decreases when a signature is eliminated by taking all of its words and eliminating their morphological structure, and reanalyzing the words as morphologically simple (i.e., as having no morphological structure). This is how we have implemented it, in any event; one could well imagine a variant under which some or all subparts of the signature that comprised other signatures were made part of those other signatures. For example, the signature NULL.ine.ly is motivated just for the stem just. Under the former triage criterion, justine and justly would be treated as unanalyzed words, whereas under the latter, just and justly would be made members of the (large) NULL.ly signature, and just and justine might additionally be treated as comprising parts of the signature NULL.ine along with bernard, gerald, eng, capitol, elephant, def, and sup (although that would involve permitting a single stem to participate in two distinct signatures).</Paragraph>
    <Paragraph position="11"> Our MDL-based measure tests the goodness of a signature by testing each signature cr to see if the analysis is better when that signature is deleted. This deletion entails treating the signature's words as members of the signature of unanalyzed words (which is the largest signature, and hence such signature pointers are relatively short). Each word member of the signature, however, now becomes a separate stem, with all of the increase in pointer length that that entails, as well as increase in letter content for the stem component.</Paragraph>
    <Paragraph position="12"> One may draw the following conclusions, I believe, from the straightforward application of such a measure. On the whole, the effects are quite good, but by no means as close as one would like to a human's decisions in a certain number of cases. In addition, the effects are significantly influenced by two decisions that we have already discussed: (i) the information associated with each letter, and (ii) the decision as to whether to model suffix frequency based solely on signature-internal frequences, or based on frequency across the entire morphology. The greater the information associated with each letter, the more worthwhile morphology is (because maintaining multiple copies of nearly similar stems becomes increasingly costly and burdensome).</Paragraph>
    <Paragraph position="13"> When suffix frequencies (which are used to compute the compressed length of any analyzed word) are based on the frequency of the suffixes in the entire lexicon, rather than conditionally within the signature in question, the loss of a signature entails a hit on the compression of all other words in the lexicon that employed that suffix; hence triage is less dramatic under that modeling assumption.</Paragraph>
    <Paragraph position="14"> Consider the effect of this computation on the signatures produced from a 500,000word corpus of English. After the modifications discussed to this point, but before triage, there were 603 signatures with two or more stems and two or more suffixes, and there were 1,490 signatures altogether. Application of triage leads to the loss  Computational Linguistics Volume 27, Number 2 of only 240 signatures. The single-suffix signatures that were eliminated were: ide, it, rs, he, ton, o, and ie, all of which are spurious. However, a number of signatures that should not have been lost were eliminated, most strikingly: NULL.ness, with 51 good analyses, NULL.ful, with 18 good analyses, and NULL.ish with only 8 analyses.</Paragraph>
    <Paragraph position="15"> Most of the cases eliminated, however, were indeed spurious. Counting only those signatures that involves suffixes (rather than compounds) and that were in fact correct, the percentage of the words whose analysis was incorrectly eliminated by triage was 21.9% (236 out of 1,077 changes). Interestingly, in light of the discussion on results above, one of the signatures that was lost was i.us for the Latin plural (based in this particular case on genii~genius). Also eliminated (and this is most regrettable) was NULL.n't (could~had~does~were~would/did).</Paragraph>
    <Paragraph position="16"> Because maximizing correct results is as important as testing the MDL model proposed here, I have also utilized a triage algorithm that departs from the MDL-based optimization in certain cases, which I shall identify in a moment. I believe that when the improvements identified in Section 10 below are made, the purely MDL-based algorithm will be more accurate; that prediction remains to be tested, to be sure. On this account, we discard any signature for which the total number of stem letters is less than five, and any signature consisting of a single, one-letter suffix; we keep, then, only signatures for which the savings in letter counts is greater than 15 (where savings in letter counts is simply the difference between the sum of the length of words spelled out as a monomorphemic word and the sum of the lengths of the stems and the suffixes); 15 is chosen empirically.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML