File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0107_metho.xml
Size: 23,310 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0107"> <Title>Unsupervised Induction of Natural Language Morphology Inflection Classes</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Inflection Classes as Motivation </SectionTitle> <Paragraph position="0"> When learning the morphology of a foreign language, it is common for a student to study tables of inflection classes. In Spanish, for example, a regular verb belongs to one of three inflection classes--verbs that take the -ar infinitive suffix inflect for various syntactic features with one set of suffixes, verbs that take the -er infinitive suffix realize the same set of syntactic features with a second set of suffixes, while -ir verbs take yet a third set.</Paragraph> <Paragraph position="1"> Carstairs-McCarthy formalizes the concept of an inflection class in chapter 16 of The Handbook of Morphology (1998). In his terminology, a language with inflectional morphology contains lexemes which occur in a variety of word forms.</Paragraph> <Paragraph position="2"> Each word form carries two pieces of information: 1) Lexical content and 2) Morphosyntactic properties.</Paragraph> <Paragraph position="3"> For example, the English word form gave expresses the lexeme GIVE plus the morphosyntactic property Past, while gives expresses GIVE plus the properties 3rd Person, Singular, and Non-Past. A set of morphosyntactic properties realized with a single word form is defined to be a cell, while a paradigm is a set of cells exactly expressed by the word forms of some lexeme.</Paragraph> <Paragraph position="4"> A particular natural language may have many paradigms. In English, a language with very little inflectional morphology, there are at least two paradigms, a noun paradigm consisting of two cells, Singular and Plural, and a paradigm for verbs, consisting of the five cells given (with one choice of naming convention) as the first column of Table 1.</Paragraph> <Paragraph position="5"> Lexemes that belong to the same paradigm may still differ in their morphophonemic realizations of various cells in that paradigm--each paradigm may have several associated inflection classes which specify, for the lexemes belonging to that inflection class, the surface instantiation for each cell of the paradigm.</Paragraph> <Paragraph position="6"> Three of the inflection classes within the English verb paradigm are found in Table 1 under the col- null column consists of entries corresponding to the cells of the verb paradigm. Each entry contains an informal notation for the morphophonemic process which the inflection class applies to the basic form of a lexeme and examples of word forms filling the corresponding paradigm cell.</Paragraph> <Paragraph position="7"> Inflection class A is one of the largest and most productive verb inflection classes in English, inflection class B contains the Perfective/Passive suffix -/n/, and C is a small &quot;irregular&quot; inflection class of strong verbs.</Paragraph> <Paragraph position="8"> The task our morphology induction system engages is exactly the discovery of the inflection classes of a natural language. Unlike the analysis in Table 1, however, the rest of this paper treats word forms as simply strings of characters as opposed to strings of phonemes.</Paragraph> </Section> <Section position="5" start_page="0" end_page="404" type="metho"> <SectionTitle> 4 Empirical Inflection Classes </SectionTitle> <Paragraph position="0"> There are two stages in our approach to unsupervised morphology induction. First, we define a search space over a set of candidate inflection classes, and second, we search this space for those candidates most likely to be part of a true inflection class in the language. In both stages of our approach we intentionally exploit the fact that suffixes belonging to the same natural language inflection class frequently occur interchangeably on the same stems.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Candidate Inflection Class Search Space </SectionTitle> <Paragraph position="0"> To define a search space wherein we hope to identify inflection classes of a natural language, our algorithm accepts as input a monolingual corpus for that language and proposes candidate morpheme boundaries at every character boundary in every word form in the corpus vocabulary. We call each string before a candidate morpheme boundary a candidate stem or c-stem, and each string after a boundary a c-suffix. We define a candidate inflection class (CIC) to be a set of c-suffixes for which there exists at least one c-stem, t, such that each c-suffix in the CIC concatenated to t produces a word form in the vocabulary. For convenience, let the set of c-stems which generate a CIC, C, be called the adherent c-stems of C; let the number of adherent c-stems of C be C's adherent size; and let the size of the set of c-suffixes in C be the level of C. We denote a CIC in this paper by a period delimited sequence of c-suffixes.</Paragraph> <Paragraph position="1"> While CIC's effectively model suffix substitution on bound stems, we would also like to model suffix concatenation onto free stems. To this end, the set of candidate morpheme boundaries our algorithm proposes include those boundaries after the final character in each word form. In this paper we assume a suffix, which we denote as O, follows all word form final boundaries. A CIC contains the O c-suffix when each c-stem in the CIC can occur, not only bound to other c-suffixes in the CIC, but also as a free stem. For generality, the boundary before the first character of each word form is also a candidate morpheme boundary.</Paragraph> <Paragraph position="2"> Table 2 illustrates the type of CIC's produced by our algorithm. The CIC's in this table, arranged in a systematic but arbitrary order, are each derived cabulary. Each entry is specified as a period delimited sequence of c-suffixes in bold above a period delimited sequence of adherent c-stems in italics from one or more forms in a small vocabulary consisting of a subset of the word forms found under inflection class A from Table 1. Proposing, as our procedure does, morpheme boundaries at every character boundary in every word form necessarily produces many ridiculous CIC's, such as ame.ames.amed, from the forms blame, blames, and blamed and the c-stem bl. Dispersed among the incorrect CIC's generated by our algorithm, however, are also CIC's that seem very reasonable, such as O.s, from the c-stems blame and tease.</Paragraph> <Paragraph position="3"> Note that where Table 1 lists all the surface forms of the three lexemes BLAME, ROAM, and SOLVE, the vocabulary of Table 2 mimics the vocabulary of a text corpus from a highly inflected language where we expect few, if any, lexemes to occur in the complete set of possible surface forms.</Paragraph> <Paragraph position="4"> Specifically, the vocabulary of Table 2 lacks the surface form blaming of the lexeme BLAME, solved of the lexeme SOLVE, and the root form roam of the lexeme ROAM. Hence, while the reasonable CIC O.s arises from the pairs of surface forms (blame, blames) and (solve, solves), there is no way for the form roams to contribute to the O.s CIC because the surface form roam is missing from this vocabulary. In other words, we lack evidence for a O suffix on the c-stem roam. Also notice that, as a result of English spelling rules, the CIC s.ed generated from the pair of surface forms (roams, roamed) is separate from each of the CIC's s.d and es.ed generated from the pair of surface forms (blames, blamed).</Paragraph> <Paragraph position="5"> Looking at Table 2, it is clear there is structure among the CIC's. In particular, at least two types of relations hold between CIC's. First, hierarchically, the c-suffixes of one CIC may be a superset of the c-suffixes of another CIC. For example the c-suffixes in the CIC e.es.ed are a superset of the c-suffixes in the CIC e.ed. Second, cutting across this hierarchical structure there is structure between CIC's which propose different morpheme boundaries within the same word forms. Compare the CIC's me.mes.med and e.es.ed; each is derived from exactly the triple of word forms blame, blames, and blamed, but differ in the placement of the hypothesized morpheme boundary.</Paragraph> <Paragraph position="6"> Taken together the hierarchical c-suffix set inclusion relations and the morpheme boundary relations impose a lattice structure on the space of CIC's. Figure 1 diagrams the CIC lattice over an interesting subset of the columns of Table 2. Hierarchical links, represented by solid lines, connect any given CIC often to more than one parent and more than one child. The empty CIC (not pictured in Figure 1) can be considered the child of all level one CIC's (including the O CIC), but there is no universal parent of all top level CIC's. Moving up the lattice always results in a monotonic decrease in adherent size because a parent CIC requires each adherent c-stem to form a word with a superset of the c-suffixes of each child.</Paragraph> <Paragraph position="7"> Horizontal morpheme boundary links, dashed lines, connect a CIC, C, with a neighbor to the right if each c-suffix in C begins with the same character. This entails that there is at most one morpheme boundary link leading to the right of each CIC. There may, however, be as many links leading to the left as there are characters in the orthography. The only CIC with depicted multiple left links in Figure 1 is O, which has left links to the CIC's e, s, and d. A number of left links emanating from the CIC's in Figure 1 are not shown; among others absent from the figure is the left link from the CIC e.es leading to the CIC ve.ves with the adherent sol. Since left links effectively divide a CIC into separate CIC's, one for each character in the orthography, adherent count monotonically decreases as left links are followed.</Paragraph> <Paragraph position="8"> To better visualize what a CIC lattice looks like when derived from real data, Figure 2 contains a portion of a hierarchical lattice automatically generated from our Spanish newswire corpus. Each entry in Figure 2 contains the c-suffixes comprising the CIC, the adherent size of the CIC, and a sample of adherent c-stems. The lattice in Figure 2 covers: toy vocabulary in Table 2 c-suffix set inclusion links morpheme boundary links 1) The productive Spanish inflection class for adjectives, a.as.o.os, covering the four adjective paradigm cells: feminine singular, feminine plural, masculine singular, and masculine plural, respectively, 2) All possible CIC subsets of the adjective CIC, e.g. a.as.o, a.os, etc. and, 3) The imposter CIC a.as.o.os.tro, together with its rogue descendents, a.tro, and tro.</Paragraph> <Paragraph position="9"> Other CIC's that are descendents of a.as.o.os.tro and that contain the c-suffix tro do not supply additional adherents and hence are not present either in Figure 2 or in our program's representation of the CIC lattice. The CIC's a.as.tro and os.tro, for example, both have only the one adherent, cas, already possessed by their common ancestor a.as.o.os.tro. Strictly speaking we have simplified for exposition, as the CIC a.as.o.os.tro is not actually present in the algorithm's representation either, because the c-stem cas occurred with a number of additional c-suffixes yielding the CIC: a.as.i.o.os.sandra.tanier.ter.tro.trol.</Paragraph> </Section> <Section position="2" start_page="0" end_page="404" type="sub_section"> <SectionTitle> 4.2 Search </SectionTitle> <Paragraph position="0"> Given the framework of CIC lattices, the key task for automatic morphology induction is to autonomously separate the nonsense CIC's from the useful ones, thus identifying linguistically plausible inflection classes. This section treats the CIC lattices as a hypothesis space of valid inflection classes and searches this space for CIC's most likely to be true inflection classes in a language.</Paragraph> <Paragraph position="1"> There are many possible search strategies and heuristics applicable to the CIC lattice, and while for future work we intend to explore a variety of search techniques, this paper presents a reasonable and intuitive baseline search procedure. We have investigated a series of algorithms which build upon each other. Each algorithm employs a number of parameters which are tuned by hand. These parameters are only interesting in so far as they help us find true CIC's from among the many in the lattice. The performance of each algorithm is described in section 6.</Paragraph> <Paragraph position="2"> To motivate the general approach we have taken, compare the adherent sizes of the various CIC's in Figure 2. The target CIC a.as.o.os, corresponding to the Spanish adjective inflection class, has 43 adherents. Its various descendents must occur with monotonically increasing adherent sizes, but frequently a child will not more than double or triple its immediate parent's adherent size, and never is there a difference greater than a factor of ten. Notice also the large adherent counts of the level one descendents of a.as.o.os, the smallest is as with 404 adherents.</Paragraph> <Paragraph position="3"> Contrast this behavior with that of CIC's involving the spurious suffix tro. The CIC a.as.o.os.tro occurs in the corpus with exactly one adherent, cas. Additionally, the word forms cena, supper, and centro, center, occur yielding the CIC a.tro with two adherents. In total tro is the final string of only 16 individual word forms.</Paragraph> <Paragraph position="4"> In general, we expect that true suffixes in a language will both occur frequently and occur attached to a large number of stems which also accept other suffixes from the same inflection class. These considerations led us to propose three parameters for our basic search strategy: L1 SIZE: A level one adherent size cutoff TOP SIZE: An absolute adherent size cutoff RATIO: A parent-to-child adherent size ratio cutoff The L1 SIZE parameter requires a c-suffix to be frequent, while the TOP SIZE and RATIO parameters require a suffix to be substitutable for other c-suffixes in a reasonable number of c-stems.</Paragraph> <Paragraph position="5"> a.as.o.os We apply these three parameters by beginning our search at the bottom of the lattice. Each level one CIC with an adherent count larger than L1 SIZE is placed in a list of path CIC's. Then for each path CIC, C, we remove C from the list of path CIC's, and in turn consider each of C's hierarchical parents, Pi. If Pi's adherent size is at least TOP SIZE, and if the ratio of Pi's adherent size to C's adherent size is larger than RATIO, then Pi is placed in the list of path CIC's. If no parent of C can be placed in the list of path CIC's, and if C's level is greater than one, then C is placed in a list of selected CIC's. When there are no more CIC's in the list of path CIC's, the search ends and the CIC's in the selected list are the CIC's the algorithm believes are true CIC's of the language. As an illustration suppose we explored the lattice in Figure 2 with the following parameter settings: Our search algorithm begins by comparing the adherent size of each level one CIC to L1 SIZE.</Paragraph> <Paragraph position="6"> The only level one CIC with an adherent count less than 100 is tro with 16 adherents, preventing tro from being placed in the list of path CIC's.</Paragraph> <Paragraph position="7"> Each of the surviving level one CIC's is then considered in turn. The algorithm comes to the CIC a, where the ratios of adherent sizes between each of its parents a.tro, a.as, a.o, and a.os and itself are 0.002, 0.161, 0.173, and 0.108 respectively. Each of these ratios, except that between a and a.tro, at 0.002, is larger than 0.1. And since the adherent sizes of a.as, a.o, and a.os are each larger than TOP SIZE, these three CIC's are placed in the list of path CIC's.</Paragraph> <Paragraph position="8"> From this point, every hierarchical link in Figure 2 leading to the CIC a.as.o.os passes the TOP SIZE and RATIO cutoffs. Thus the algorithm reaches a state where the only CIC in the list of path CIC's is a.as.o.os. When this good CIC is removed from the list of path CIC's, the algorithm finds that its only parent is a.as.o.os.tro with its lone adherent. Since TOP SIZE requires a parent to have at least two adherents, a.as.o.os.tro cannot be placed in the list of path CIC's. As no parent can be placed in the list of path CIC's, a.as.o.os is placed in the list of selected CIC's--which is the desired outcome. The list of path CIC's is now empty and the search ends.</Paragraph> <Paragraph position="9"> To improve performance over the Vertical Only algorithm we next incorporated knowledge from the horizontal morpheme boundary links. Monson (2004) describes how morpheme boundary links in a CIC lattice can be thought of as branchings in a vocabulary trie where identical subtries are conflated. Harris (1955) discusses how the branching count in a suffix trie can be exploited to identify morpheme boundaries. We extend the spirit of Harris' work in our algorithm through the use of two search parameters: HORIZ RATIO: A cutoff over: sizeadherent character in ending adherents of #argmax c c HORIZ SIZE: An adherent size cutoff</Paragraph> </Section> <Section position="3" start_page="404" end_page="404" type="sub_section"> <SectionTitle> Left Blocking </SectionTitle> <Paragraph position="0"> In the first variant of horizontal blocking we apply these two horizontal parameters when considering a CIC, C, removed from the list of path CIC's. If the adherent size of C is larger than HORIZ SIZE and the maximum percentage of adherents of C that end in any one character is larger than HORIZ RATIO, then C is simply thrown out.</Paragraph> <Paragraph position="1"> For example, suppose we used the following horizontal parameter settings: The CIC da.do in our Spanish corpus has 62 adherents, 46, or a fraction of 0.742, of which end in the character a (ada and ado fill the feminine and masculine past participle cells for the -ar verb inflection class). If our Left Blocking search algorithm reached the CIC da.do, it would be discarded because while its adherent size is larger than HORIZ SIZE more than half of its adherents end with the same character. Notice that this algorithm does not explicitly follow leftward morpheme boundary links. The rationale for this behavior is that ada.ado will likely be explored independently by a separate vertical path. In future experiments we intend to investigate the effect of ensuring that the CIC to the left is explored by overtly placing the leftward CIC in the list of path CIC's.</Paragraph> </Section> <Section position="4" start_page="404" end_page="404" type="sub_section"> <SectionTitle> Right Blocking </SectionTitle> <Paragraph position="0"> So far we have only described an algorithm to block paths where the correct morpheme boundary is to the left of the current hypothesis. There are also CIC's where a morpheme boundary should be moved to the right. The CIC cada.cado with seven adherents is one such.</Paragraph> <Paragraph position="1"> Accordingly, whenever we encounter a CIC, C, all of whose c-suffixes begin with the same character (e.g. c in cada.cado) our algorithm poses the question, if we were considering the CIC to the right (e.g. ada.ado) would we have triggered Left Blocking? If Left Blocking would not have been triggered then we throw C out. In other words, we prefer the rightmost possible morpheme boundary, unless there is some reason to believe the morpheme boundary should be to the left.</Paragraph> <Paragraph position="2"> Taking a closer look at cada.cado, the CIC to its right, ada.ado, has 46 adherents of which the character c ends the most, 7 or a fraction of 0.152. If we were using a HORIZ RATIO of 0.5 as in the previous section, Left Blocking would not be triggered from ada.ado and so Right Blocking is triggered, throwing out cada.cado. On the other hand, if we were considering blocking ada.ado, where both c-suffixes begin with a, the HORIZ RATIO parameter would need to be larger than 0.742 before right blocking would throw out ada.ado.</Paragraph> </Section> <Section position="5" start_page="404" end_page="404" type="sub_section"> <SectionTitle> Right Blocking Recursive </SectionTitle> <Paragraph position="0"> In addition to standard Right Blocking we explored recursively looking at the next most right neighbor of a CIC if the immediate right neighbor falls below the HORIZ SIZE threshold. The rationale behind this variant stems from CIC's such as icada.icado with 4 adherents, crit, publ, ratif, and ub. Since icada.icado's immediate right neighbor cada.cado has only 7 adherents itself we may not want to base our blocking decision on so little data.</Paragraph> <Paragraph position="1"> Instead we consider the CIC ada.ado, discussed in the previous section, which has a large enough adherent size that we might feel confident in our judgment.</Paragraph> </Section> <Section position="6" start_page="404" end_page="404" type="sub_section"> <SectionTitle> Full Horizontal Blocking </SectionTitle> <Paragraph position="0"> The final version of the search we tried was to combine Left Blocking and Right Blocking Recursive while constraining both to use the same values for the HORIZ RATIO and HORIZ SIZE parameters.</Paragraph> </Section> </Section> <Section position="6" start_page="404" end_page="404" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the performance of the various base-line search strategies, we first decided on a standard set of six inflection classes for Spanish: two for nouns, O.s and O.es, one for adjectives, a.as.o.os, and three for verbs, corresponding to the traditional -ar, -er, and -ir verb inflection classes.</Paragraph> <Paragraph position="1"> We call these six inflection classes our set of standard IC's. We make no claim as to the truth or completeness of the set of standard inflection classes we used in this evaluation. The standard IC's we compiled were simply some of the most common suffixes filling some of the most common morphosyntactic properties marked in Spanish.</Paragraph> <Paragraph position="2"> We then defined measures of recall, precision, and fragmentation over these standard IC's (Figure 3). As defined, recall measures the fraction of unique suffixes in the standard IC's that are found within those selected CIC's that are subsets of some inflection class in the standard; precision measures the fraction of unique suffixes among all the selected CIC's that are found within those selected CIC's that are subsets of an inflection class in the standard; and fragmentation measures redundancy, specifically calculating the ratio of the number of selected CIC's that are subsets of standard IC's to the number of inflection classes in the standard. High values for recall and precision are desirable, while a fragmentation of exactly 1 implies that the number of usefully selected CIC's is the same as the number of inflection classes in the standard.</Paragraph> </Section> class="xml-element"></Paper>