File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1036_metho.xml

Size: 24,834 bytes

Last Modified: 2025-10-06 14:08:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1036">
  <Title>Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Probabilistic generative model
</SectionTitle>
    <Paragraph position="0"> In this section we derive the new model. We follow a step-by-step process, during which a morph lexicon and a corpus are generated. The morphs in the lexicon are strings that emerge as a result of a stochastic process. The corpus is formed through another stochastic process that picks morphs from the lexicon and places them in a sequence. At two points of the process, prior knowledge is required in the form of two real numbers: the most common morph length and the proportion of hapax legomena morphs.</Paragraph>
    <Paragraph position="1"> The model can be used for segmentation of words by requiring that the corpus created is exactly the input data. By selecting the most probable morph lexicon that can produce the input data, we obtain a segmentation of the words in the corpus, since we can rewrite every word as a sequence of morphs.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Size of the morph lexicon
</SectionTitle>
      <Paragraph position="0"> We start the generation process by deciding the number of morphs in the morph lexicon (type count).</Paragraph>
      <Paragraph position="1"> This number is denoted by nu and its probability p(nu) follows the uniform distribution. This means that, a priori, no lexicon size is more probable than another.3 2We use standard terminology: Morph types are the set of different, distinct morphs. By contrast, morph tokens are the instances (or occurrences) of morphs in the corpus.</Paragraph>
      <Paragraph position="2"> 3This is an improper prior, but it is of little practical significance for two reasons: (i) This stage of the generation process</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Morph lengths
</SectionTitle>
      <Paragraph position="0"> For each morph in the lexicon, we independently choose its length in characters according to the</Paragraph>
      <Paragraph position="2"> where lui is the length in characters of the ith morph, and a and b are constants. G(a) is the gamma function: null</Paragraph>
      <Paragraph position="4"> The maximum value of the density occurs at lui = (a [?] 1)b, which corresponds to the most common morph length in the lexicon. When b is set to one, and a to one plus our prior belief of the most common morph length, the pdf (probability density function) is completely defined.</Paragraph>
      <Paragraph position="5"> We have chosen the gamma distribution for morph lengths, because it corresponds rather well to the real length distribution observed for word types in Finnish and English corpora that we have studied. The distribution also fits the length distribution of the morpheme labels used as a reference (cf. Section 3). A Poisson distribution can be justified and has been used in order to model the length distribution of word and morph tokens [e.g., (Creutz and Lagus, 2002)], but for morph types we have chosen the gamma distribution, which has a thicker tail.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Morph strings
</SectionTitle>
      <Paragraph position="0"> For each morph ui, we decide the character string it consists of: We independently choose lui characters at random from the alphabet in use. The probability of each character cj is the maximum likelihood estimate of the occurrence of this character in the</Paragraph>
      <Paragraph position="2"> where ncj is the number of occurrences of the character cj in the corpus, and summationtextk nck is the total number of characters in the corpus.</Paragraph>
      <Paragraph position="3"> only contributes with one probability value, which will have a negligible effect on the model as a whole. (ii) A proper probability density function would presumably be very flat, which would hardly help guiding the search towards an optimal model.</Paragraph>
      <Paragraph position="4"> 4Alternatively, the maximum likelihood estimate of the occurrence of the character in the lexicon could be used.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Morph order in the lexicon
</SectionTitle>
      <Paragraph position="0"> The lexicon consists of a set of nu morphs and it makes no difference in which order these morphs have emerged. Regardless of their initial order, the morphs can be sorted into a uniquely defined (e.g., alphabetical) order. Since there are nu! ways to order nu different elements,5 we multiply the probability accumulated so far by nu!:</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Morph frequencies
</SectionTitle>
      <Paragraph position="0"> The next step is to generate a corpus using the morph lexicon obtained in the previous steps. First, we independently choose the number of times each morph occurs in the corpus. We pursue the following line of thought: Zipf has studied the relationship between the frequency of a word, f, and its rank, z.6 He suggests that the frequency of a word is inversely proportional to its rank. Mandelbrot has refined Zipf's formula, and suggests a more general relationship [see, e.g., (Baayen, 2001)]:</Paragraph>
      <Paragraph position="2"> where C,a and b are parameters of a text.</Paragraph>
      <Paragraph position="3"> Let us derive a probability distribution from Mandelbrot's formula. The rank of a word as a function of its frequency can be obtained by solving for z from (5):</Paragraph>
      <Paragraph position="5"> Suppose that one wants to know the number of words that have a frequency close to f rather than the rank of the word with frequency f. In order to obtain this information, we choose an arbitrary interval around f: [(1/g)f ...gf[, where g &gt; 1, and compute the rank at the endpoints of the interval.</Paragraph>
      <Paragraph position="6"> The difference is an estimate of the number of words 5Strictly speaking, our probabilistic model is not perfect, since we do not make sure that no morph can appear more than once in the lexicon.</Paragraph>
      <Paragraph position="7"> 6The rank of a word is the position of the word in a list, where the words have been sorted according to falling frequency. null that fall within the interval, i.e., have a frequency close to f:</Paragraph>
      <Paragraph position="9"> This can be transformed into an exponential pdf by (i) binning the frequency axis so that there are no overlapping intervals. (This means that the frequency axis is divided into non-overlapping intervals [(1/g) ^f ...g ^f[, which is equivalent to having ^f values that are powers of g2: ^f0 = g0 = 1, ^f1 = g2, ^f2 = g4,... All frequencies f are rounded to the closest ^f.) Next (ii), we normalize the number of words with a frequency close to ^f with the total number of words summationtext^f n^f. Furthermore (iii), ^f is written as elog ^f, and (iv) C must be chosen so that the normalization coefficient equals 1/a, which yields a proper pdf that integrates to one. Note also the factor logg2. Like ^f, log ^f is a discrete variable.</Paragraph>
      <Paragraph position="10"> We approximate the integral of the density function around each value log ^f by multiplying with the difference between two successive log ^f values, which  Now, if we assume that Zipf's and Madelbrot's formulae apply to morphs as well as to words, we can use formula (8) for every morph frequency fui, which is the number of occurrences (or frequency) of the morph ui in the corpus (token count). However, values for a and g2 must be chosen. We set g2 to 1.59, which is the lowest value for which no empty frequency bins will appear.7 For fui = 1, (8) reduces to logg2/a. We set this value equal to our prior belief of the proportion of morph types that are to occur only once in the corpus (hapax legomena).</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.6 Corpus
</SectionTitle>
      <Paragraph position="0"> The morphs and their frequencies have been set. The order of the morphs in the corpus remains to be decided. The probability of one particular order is the inverse of the multinomial: 7Empty bins can appear for small values of fu i due to fui's being rounded to the closest ^fui, which is a power of g2.</Paragraph>
      <Paragraph position="2"> The numerator of the multinomial is the factorial of the total number of morph tokens, N, which equals the sum of frequencies of every morph type. The denominator is the product of the factorial of the frequency of each morph type.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.7 Search for the optimal model
</SectionTitle>
      <Paragraph position="0"> The search for the optimal model given our input data corresponds closely to the recursive segmentation algorithm presented in (Creutz and Lagus, 2002). The search takes place in batch mode, but could as well be done incrementally. All words in the data are randomly shuffled, and for each word, every split into two parts is tested. The most probable split location (or no split) is selected and in case of a split, the two parts are recursively split in two.</Paragraph>
      <Paragraph position="1"> All words are iteratively reprocessed until the probability of the model converges.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> From the point of view of linguistic theory, it is possible to come up with different plausible suggestions for the correct location of morpheme boundaries. Some of the solutions may be more elegant than others,8 but it is difficult to say if the most elegant scheme will work best in practice, when real NLP applications are concerned.</Paragraph>
    <Paragraph position="1"> We utilize an evaluation method for segmentation of words presented in (Creutz and Lagus, 2002). In this method, segments are not compared to one single &amp;quot;correct&amp;quot; segmentation. The evaluation criterion can rather be interpreted from the point of view of language &amp;quot;understanding&amp;quot;. A morph discovered by the segmentation algorithm is considered to be &amp;quot;understood&amp;quot;, if there is a low-ambiguity mapping from the morph to a corresponding morpheme. Alternatively, a morph may correspond to a sequence of morphemes, if these morphemes are very likely to occur together. The idea is that if an entirely new word form is encountered, the system will &amp;quot;understand&amp;quot; it by decomposing it into morphs that it &amp;quot;understands&amp;quot;. A segmentation algorithm that segments 8Cf. &amp;quot;hop + ed&amp;quot; vs. &amp;quot;hope + d&amp;quot; (past tense of &amp;quot;to hope&amp;quot;). words into too small parts will perform poorly due to high ambiguity. At the other extreme, an algorithm that is reluctant at splitting words will have bad generalization ability to new word forms.</Paragraph>
    <Paragraph position="2"> Reference morpheme sequences for the words are obtained using existing software for automatic morphological analysis based on the two-level morphology of Koskenniemi (1983). For each word form, the analyzer outputs the base form of the word together with grammatical tags. By filtering the output, we get a sequence of morpheme labels that appear in the correct order and represent correct morphemes rather closely. Note, however, that the morpheme labels are not necessarily orthographically similar to the morphemes they represent.</Paragraph>
    <Paragraph position="3"> The exact procedure for evaluating the segmentation of a set of words consists of the following steps:  (1) Segment the words in the corpus using the automatic segmentation algorithm.</Paragraph>
    <Paragraph position="4"> (2) Divide the segmented data into two parts of equal size. Collect all segmented word forms from the first part into a training vocabulary and collect all segmented word forms from the second part into a test vocabulary.</Paragraph>
    <Paragraph position="5"> (3) Align the segmentation of the words in the  training vocabulary with the corresponding reference morpheme label sequences. Each morph must be aligned with one or more consecutive morpheme labels and each morpheme label must be aligned with at least one morph; e.g., for a hypothetical segmentation of the English word winners': Morpheme labels win -ER PL GEN Morph sequence w inn er s' (4) Estimate conditional probabilities for the morph/morpheme mappings computed over the whole training vocabulary: p(morpheme |morph).</Paragraph>
    <Paragraph position="6"> Re-align using the Viterbi algorithm and employ the Expectation-Maximization algorithm iteratively until convergence of the probabilities.</Paragraph>
    <Paragraph position="7"> (5) The quality of the segmentation is evaluated on the test vocabulary. The segmented words in the test vocabulary are aligned against their reference morpheme label sequences according to the conditional probabilities learned from the training vocabulary. To measure the quality of the segmentation we compute the expectation of the proportion of correct mappings from morphs to morpheme labels,</Paragraph>
    <Paragraph position="9"> where N is the number of morph/morpheme mappings, and pi(*) is the probability associated with the ith mapping. Thus, we measure the proportion of morphemes in the test vocabulary that we can expect to recognize correctly by examining the morph</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We have conducted experiments involving (i) three different segmentation algorithms, (ii) two corpora in different languages (Finnish and English), and (iii) data sizes ranging from 2000 words to 200 000 words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Segmentation algorithms
</SectionTitle>
      <Paragraph position="0"> The new probabilistic method is compared to two existing segmentation methods: the Recursive MDL method presented in (Creutz and Lagus, 2002)10 and John Goldsmith's algorithm called Linguistica (Goldsmith, 2001).11 Both methods use MDL (Minimum Description Length) (Rissanen, 1989) as a criterion for model optimization.</Paragraph>
      <Paragraph position="1"> The effect of using prior information on the distribution of morph length and frequency can be assessed by comparing the probabilistic method to Recursive MDL, since both methods utilize the same search algorithm, but Recursive MDL does not make use of explicit prior information.</Paragraph>
      <Paragraph position="2"> Furthermore, the possible benefit of using the two sources of prior information can be compared against the possible benefit of grouping stems and suffixes into signatures. The latter technique is employed by Linguistica.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Data
</SectionTitle>
      <Paragraph position="0"> The Finnish data consists of subsets of a newspaper text corpus from CSC,12 from which non-words (numbers and punctuation marks) have been  removed. The reference morpheme labels have been filtered out from a morphosyntactic analysis of the text produced by the Connexor FDG parser.13 The English corpus consists of mainly newspaper text (with non-words removed) from the Brown corpus.14 A morphological analysis of the words has been performed using the Lingsoft ENGTWOL analyzer.15 null For both languages data sizes of 2000, 5000, 10 000, 50 000, 100 000, and 200 000 have been used. A notable difference between the morphological structure of the languages lies in the fact that whereas there are about 17 000 English word types in the largest data set, the corresponding number of Finnish word types is 58 000.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Parameters
</SectionTitle>
      <Paragraph position="0"> In order to select good prior values for the probabilistic method, we have used separate development test sets that are independent of the final data sets. Morph length and morph frequency distributions have been computed for the reference morpheme representations of the development test sets.</Paragraph>
      <Paragraph position="1"> The prior values for most common morph length and proportion of hapax legomena have been adjusted in order to produce distributions that fit the reference as well as possible.</Paragraph>
      <Paragraph position="2"> We thus assume that we can make a good guess of the final morph length and frequency distributions.</Paragraph>
      <Paragraph position="3"> Note, however, that our reference is an approximation of a morpheme representation. As the segmentation algorithms produce morphs, not morphemes, we can expect to obtain a larger number of morphs due to allomorphy. Note also that we do not optimize for segmentation performance on the development test set; we only choose the best fit for the morph length and frequency distributions.</Paragraph>
      <Paragraph position="4"> As for the two other segmentation algorithms, Recursive MDL has no parameters to adjust. In Linguistica we have used Method A Suffixes + Find prefixes from stems with other parameters left at their default values. We are unaware whether another configuration could be more advantageous for Linguistica. null  nized morphemes for Finnish data.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Results
</SectionTitle>
      <Paragraph position="0"> The expected proportion of morphemes recognized by the three segmentation methods are plotted in Figures 1 and 2 for different sizes of the Finnish and English corpora. The search algorithm used in the probabilistic method and Recursive MDL involve randomness and therefore every value shown for these two methods is the average obtained over ten runs with different random seeds. However, the fluctuations due to random behaviour are very small and paired t-tests show significant differences at the significance level of 0.01 for all pair-wise comparisons of the methods at all corpus sizes.</Paragraph>
      <Paragraph position="1"> For Finnish, all methods show a curve that mainly increases as a function of the corpus size. The probabilistic method is the best with morpheme recognition percentages between 23.5% and 44.2%. Linguistica performs worst with percentages between 16.5% and 29.1%. None of the methods are close to ideal performance, which, however, is lower than 100%. This is due to the fact that the test vocabulary contains a number of morphemes that are not present in the training vocabulary, and thus are impossible to recognize. The proportion of unrecognizable morphemes is highest for the smallest corpus size (32.5%) and decreases to 8.8% for the largest corpus size.</Paragraph>
      <Paragraph position="2"> The evaluation measure used unfortunately scores  nized morphemes for English data.</Paragraph>
      <Paragraph position="3"> a baseline of no segmentation fairly high. The no-segmentation baseline corresponds to a system that recognizes the training vocabulary fully, but has no ability to generalize to any other word form.</Paragraph>
      <Paragraph position="4"> The results for English are different. Linguistica is the best method for corpus sizes below 50 000 words, but its performance degrades from the maximum of 39.6% at 10 000 words to 29.8% for the largest data set. The probabilistic method is constantly better than Recursive MDL and both methods outperform Linguistica beyond 50 000 words. The recognition percentages of the probabilistic method vary between 28.2% and 43.6%. However, for corpus sizes above 10 000 words none of the three methods outperform the no-segmentation baseline.</Paragraph>
      <Paragraph position="5"> Overall, the results for English are closer to ideal performance than was the case for Finnish. This is partly due to the fact that the proportion of unseen morphemes that are impossible to recognize is higher for English (44.5% at 2000 words, 19.0% at 200 000 words).</Paragraph>
      <Paragraph position="6"> As far as the time consumption of the algorithms is concerned, the largest Finnish corpus took 20 minutes to process for the probabilistic method and Recursive MDL, and 40 minutes for Linguistica. The largest English corpus was processed in less than three minutes by all the algorithms. The tests were run on a 900 MHz AMD Duron processor with</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
256 MB RAM.
5 Discussion
</SectionTitle>
    <Paragraph position="0"> For small data sizes, Recursive MDL has a tendency to split words into too small segments, whereas Linguistica is much more reluctant at splitting words, due to its use of signatures. The extent to which the probabilistic method splits words lies somewhere in between the two other methods.</Paragraph>
    <Paragraph position="1"> Our evaluation measure favours low ambiguity as long as the ability to generalize to new word forms does not suffer. This works against all segmentation methods for English at larger data sizes. The English language has rather simple morphology, which means that the number of different possible word forms is limited. The larger the training vocabulary, the broader coverage of the test vocabulary, and therefore the no-segmentation approach works surprisingly well. Segmentation always increases ambiguity, which especially Linguistica suffers from as it discovers more and more signatures and short suffixes as the amount of data increases. For instance, a final 's' stripped off its stem can be either a noun or a verb ending, and a final 'e' is very ambiguous, as it belongs to orthography rather than morphology and does not correspond to any morpheme.</Paragraph>
    <Paragraph position="2"> Finnish morphology is more complex and there are endless possibilities to construct new word forms. As can be seen from Figure 1, the probabilistic method and Recursive MDL perform better than the no-segmentation baseline for all data sizes.</Paragraph>
    <Paragraph position="3"> The segmentations could be evaluated using other measures, but for language modelling purposes, we believe that the evaluation measure should not favour shattering of very common strings, even though they correspond to more than one morpheme.</Paragraph>
    <Paragraph position="4"> These strings should rather work as individual vocabulary items in the model. It has been shown that increased performance of n-gram models can be obtained by adding larger units consisting of common word sequences to the vocabulary; see e.g., (Deligne and Bimbot, 1995). Nevertheless, in the near future we wish to explore possibilities of using complementary and more standard evaluation measures, such as precision, recall, and F-measure of the discovered morph boundaries.</Paragraph>
    <Paragraph position="5"> Concerning the length and frequency prior distributions in the probabilistic model, one notes that they are very general and do not make far-reaching assumptions about the behaviour of natural language. In fact, Zipf's law has been shown to apply to randomly generated artificial texts (Li, 1992). In our implementation, due to the independence assumptions made in the model and due to the search algorithm used, the choice of a prior value for the most common morph length is more important than the hapax legomena value. If a very bad prior value for the most common morph length is used performance drops by twelve percentage units, whereas extreme hapax legomena values only reduces performance by two percentage units. But note that the two values are dependent: A greater average morph length means a greater number of hapax legomena and vice versa.</Paragraph>
    <Paragraph position="6"> There is always room for improvement. Our current model does not represent contextual dependencies, such as phonological rules or morphotactic limitations on morph order. Nor does it identify which morphs are allomorphs of the same morpheme, e.g., &amp;quot;city&amp;quot; and &amp;quot;citi + es&amp;quot;. In the future, we expect to address these problems by using statistical language modelling techniques. We will also study how the algorithms scale to considerably larger corpora.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML