File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0602_metho.xml
Size: 15,569 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0602"> <Title>Unsupervised Learning of Morphology Using a Novel Directed Search Algorithm: Taking the First Step</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, </SectionTitle> <Paragraph position="0"> Morphological and Phonological Learning: Proceedings of the 6th Workshop of the which uses a generative probability model and a hill climbing search. No quantitative studies had been conducted on it, and it appears that the hill-climbing search used limits that system's usefulness. We have developed a system based on a novel search and an extension of the previous probability model of Snover and Brent.</Paragraph> <Paragraph position="1"> The use of probabilistic models is equivalent to minimum description length models. Searching for the most probable hypothesis is just as compelling as searching for the smallest hypothesis and a model formulated in one framework can, through some mathematical manipulation, be reformulated into the other framework. By taking the negative log of a probability distribution, one can find the number of bits required to encode a value according to that distribution. Our system does not use the minimum description length principle but could easily be reformulated to do so.</Paragraph> <Paragraph position="2"> Our goal in designing this system, is to be able to detect the final stem and suffix break of each word given a list of the most common words in a language.</Paragraph> <Paragraph position="3"> We do not distinguish between derivational and inflectional suffixation or between the notion of a stem and a base. Our probability model differs slightly from that of Snover and Brent (2001), but the main difference is in the search technique. We find and analyze subsets of the lexicon to find good solutions for a small set of words. We then combine these sub-hypotheses to form a morphological analysis of the entire input lexicon. We do not attempt to learn prefixes, infixes, or other more complex morphological systems, such as template-based morphology: we are attempting to discover the component of many morphological systems that is strictly concatenative.</Paragraph> <Paragraph position="4"> Finally, our model does not currently have a mechanism to deal with multiple interpretations of a word, or to deal with morphological ambiguity.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Probability Model </SectionTitle> <Paragraph position="0"> This section introduces a prior probability distribution over the space of all hypotheses, where a hypothesis is a set of words, each with morphological split separating the stem and suffix. The distribution is based on a seven-model model for the generation of hypothesis, which is heavily based upon the probability model presented in Snover and Brent (2001), with steps 1-3 of the generative procedure being the same. The two models diverge at step 4 with the pairing of stems and suffixes. Whereas the previous model paired individual stems with suffixes, our new model uses the abstract structure of paradigms.</Paragraph> <Paragraph position="1"> A paradigm is a set of suffixes and the stems that attach to those suffixes and no others. Each stem is in exactly one paradigm, and each paradigm has at least one stem. This is an important improvement to the model as it takes into account the patterns in which stems and suffixes attach.</Paragraph> <Paragraph position="2"> The seven steps are presented below, along with their probability distributions and a running example of how a hypothesis could be generated by this process. By taking the product over the distributions from all of the steps of the generative process, one can calculate the prior probability for any given hypothesis. What is described in this section is a mathematical model and not an algorithm intended to be run.</Paragraph> <Paragraph position="3"> 1. Choose the number of stems, a0 , according to the distribution:</Paragraph> <Paragraph position="5"> The a9a19a18 a10 a11 term normalizes the inverse-squared distribution on the positive integers. The number of suffixes, a20 is chosen according to the same probability distribution. The symbols M for steMs and X for suffiXes are used throughout this paper.</Paragraph> <Paragraph position="6"> Example: a0 = 5. a20 = 3.</Paragraph> <Paragraph position="7"> 2. For each stem a21 , choose its length in letters a22a24a23a25 , according to the inverse squared distribution.</Paragraph> <Paragraph position="8"> Assuming that the lengths are chosen independently and multiplying together their probabilities we have:</Paragraph> <Paragraph position="10"> (2) The distribution for the lengths of the suffixes, a22a3a37 , is similar to (2), differing only in that suffixes of length 0 are allowed, by offsetting the length by one.</Paragraph> <Paragraph position="11"> Example: a22 a23 = 4, 4, 4, 3, 3. a22 a37 = 2, 0, 1. 3. Let a0 be the alphabet, and let a1a3a2 a35a5a4a6a4a6a4a2a8a7a9a10a7a12a11 be a probability distribution on a0 . For each a21 from 1 to a0 , generate stem a21 by choosing a22 a23a25 letters at random, according to the probabilities</Paragraph> <Paragraph position="13"> The suffix set SUFF is generated in the same manner. The probability of any character, a13 , being chosen is obtained from a maximum likeli-</Paragraph> <Paragraph position="15"> a15 is the count of a13 among all the hypothesized stems and suffixes and a19 a8a21a20</Paragraph> <Paragraph position="17"> The joint probability of the hypothesized stem and suffix sets is defined by the distribution:</Paragraph> <Paragraph position="19"> The factorial terms reflect the fact that the stems and suffixes could be generated in any order.</Paragraph> <Paragraph position="20"> Example: STEM = a1 walk, look, door, far, cata11 . SUFF = a1 ed, a37 , sa11 .</Paragraph> <Paragraph position="21"> 4. We now choose the number of paradigms, a38 , which can range from 1 to a0 since each stem is in exactly one paradigm, and each paradigm has at least one stem. We pick a38 according to the following uniform distribution:</Paragraph> <Paragraph position="23"> 5. We choose the number of suffixes in the paradigms, a40 , according to a uniform distribution. The distribution for picking a40 a25 , suffixes for paradigm a21 is: fixes, PARAa37a25 that the paradigm will represent. The number of subsets of a given size is finite so we can again use the uniform distribution. This implies that the probability of each individual subset of size a40 a25 , is the inverse of the total number of such subsets. Assuming that the choices for each paradigm are independent:</Paragraph> <Paragraph position="25"> 7. For each stem choose the paradigm that the stem will belong in, according to a distribution that favors paradigms with more stems. The probability of choosing a paradigm a21 , for a stem is calculated using a maximum likelihood estimate: null</Paragraph> <Paragraph position="27"> Combining the results of stages 6 and 7, one can see that the running example would yield the hypothesis consisting of the set of words with suffix breaks, a1 walk+a37 , walk+s, walk+ed, look+a37 , look+s, look+ed, far+a37 , door+a37 , door+s, cat+a37 , cat+sa11 . Removing the breaks in the words results in the set of input words. To find the probability for this hypothesis just take of the product of the probabilities from equations (1) to (7).</Paragraph> <Paragraph position="28"> The inverse squared distribution is used in steps 1 and 2 to simulate a relatively uniform probability distribution over the positive integers, that slightly favors smaller numbers. Experiments substituting the universal prior for integers, developed by Rissanen (1989), for the inverse squared distribution, have shown that the model is not sensitive to the exact distribution used for these steps. Only slight differences in the some of the final hypotheses were found, and it was unclear which of the methods produced superior results. The reason for the lack of effect is that the two distributions are not too dissimilar and steps 1 and 2 are not main contributors to the probability mass of our model. Thus, for the sake of computational simplicity we use the inverse squared distribution for these steps.</Paragraph> <Paragraph position="29"> Using this generative model, we can assign a probability to any hypothesis. Typically one wishes to know the probability of the hypothesis given the data, however in our case such a distribution is not required. Equation (8) shows how the probability of the hypothesis given the data could be derived from Our search only considers hypotheses consistent with the data. The probability of the data given the hypothesis, a1a3a2a27a4 Dataa28Hypa6 , is always a14 , since if you remove the breaks from any hypothesis, the input data is produced. This would not be the case if our search considered inconsistent hypotheses. The prior probability of the data is unknown, but is constant over all hypotheses, thus the probability of the hypothesis given the data reduces to a1a26a2a5a4 Hypa6 a18</Paragraph> <Paragraph position="31"> The prior probability of the hypothesis is given by the above generative process and, among all consistent hypotheses, the one with the greatest prior probability also has the greatest posterior probability.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Search </SectionTitle> <Paragraph position="0"> This section details a novel search algorithm which is used to find the most likely segmentation of the all the words in the input lexicon, a22 . The input lexicon is a list of words extracted from a corpus. The output of the search is a segmentation of each of the input words into a stem and suffix. The algorithm does not directly attempt to find the most probable hypothesis consistent with the input, but finds a highly probable consistent hypothesis.</Paragraph> <Paragraph position="1"> The directed search is accomplished in two steps.</Paragraph> <Paragraph position="2"> First sub-hypotheses, each of which is a hypothesis about a subset of the lexicon, are examined and ranked. The a0 best sub-hypotheses are then incrementally combined until a single sub-hypothesis remains. The remainder of the input lexicon is added to this sub-hypothesis at which point it becomes the final hypothesis.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Ranking Sub-Hypotheses </SectionTitle> <Paragraph position="0"> We define the set of possible suffixes to be the set of terminal substrings, including the empty string a37 , of the words in a22 . Each subset of the possible suffixes has a corresponding sub-hypothesis. The sub-hypothesis, a1 , corresponding to a set of suffixes SUFFa2 , has the set of stems STEMSa2 . For each stem a3 and suffix a4 , in a1 , the word a3a6a5a7a4 must be a word in the input lexicon. STEMa2 is the maximal sized set of stems that meets this requirement.</Paragraph> <Paragraph position="1"> The sub-hypothesis, a1 , is thus the hypothesis over the set of words formed by all pairings of the stems in STEMa2 and the suffixes in SUFFa2 with the corresponding morphological breaks. One can think of each sub-hypothesis as initially corresponding to a maximally filled paradigm. We only consider sub-hypotheses which have at least two stems and two suffixes.</Paragraph> <Paragraph position="2"> For each sub-hypothesis, a1 , there is a corresponding counter hypothesis, a8a1 , which has the same set of words as a1 , but in which all the words are hypothesized to consist of the word as the stem and a37 as the suffix.</Paragraph> <Paragraph position="3"> We can now assign a score to each sub-hypothesis as follows: scorea4a1 a6 a8 a1a3a2a5a4a1 a6 a18 a1a26a2 a4 a8a1 a6 . This reflects how much more probable a1 is for those words, than the counter or null hypothesis.</Paragraph> <Paragraph position="4"> The number of possible sub-hypotheses grows considerably as the number of words increases, causing the examination of all possible sub-hypotheses at very large lexicon sizes to become unreasonable. However since we are only concerned with finding the a0 best sub-hypotheses, we do not actually need to examine every sub-hypothesis. A variety of different search algorithms can be used to find high scoring sub-hypotheses without significant risk of missing any of thea0 best sub-hypothesis.</Paragraph> <Paragraph position="5"> One can view all sub-hypotheses as nodes in a directed graph. Each node, a0 a25 , is connected to another node, a0a2a1 if and only if a0a2a1 represents a superset of the suffixes that a0 a25 represents, which is exactly one suffix greater in size than the set that a0 a25 represents. By beginning at the node representing no suffixes, one can apply standard graph search techniques, such as a beam search or a best first search to find the a0 best scoring nodes without visiting all nodes. While one cannot guarantee that such approaches perform exactly the same as examining all sub-hypotheses, initial experiments using a beam search with a beam size equal to a0 , with a a0 of 100, show that the a0 best sub-hypotheses are found with a significant decrease in the number of nodes visited. The experiments presented in this paper do not use these pruning methods.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Combining Sub-Hypotheses </SectionTitle> <Paragraph position="0"> The highest a0 scoring sub-hypotheses are incrementally combined in order to create a hypothesis over the complete set of input words. The selection ofa0 should not vary from language to language and is simply a way of limiting the computational complexity of the algorithm. Changing the value of a0 does not dramatically alter the results of the algorithm, though higher values ofa0 give slightly better results. We leta0 be 100 in the experiments reported here.</Paragraph> <Paragraph position="1"> Let a19 be the set of the a0 highest scoring subhypotheses. We remove from a19 the sub-hypothesis, a3a5a4 , which has the highest score. The words in a3a6a4 are now added to each of the remaining sub-hypotheses in a19 , and their counter hypotheses. Every subhypothesis, a3 , and its counter,</Paragraph> <Paragraph position="3"> such that they now contain all the words from a3 a4 with the morphological breaks those words had in a3a7a4 . If a word was already in a3 and a3 and it is also in a3 a4 then it now has the morphological break from a3 a4 , overriding whatever break was previously attributed to the word.</Paragraph> <Paragraph position="4"> All of the sub-hypotheses now need to be rescored, as the words in them will likely have changed. If, after rescoring, none of the sub-hypotheses have scores greater than one, then we use a3 a4 as our final hypothesis. Otherwise we repeat the process of selecting a3a8a4 and adding it in. We continue doing this until all sub-hypotheses have scores of one or less or there are no sub-hypotheses left. The final sub-hypothesis, a3 a4 , is now converted into a full hypothesis over all the words. All words in a22 , that are not in a3a8a4 are added to a3a8a4 witha37 as their suffix. This results in a hypothesis over all the words in a22 .</Paragraph> </Section> </Section> class="xml-element"></Paper>