File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1036_intro.xml

Size: 6,373 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1036">
  <Title>Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In order to artificially &amp;quot;understand&amp;quot; or produce natural language, a system presumably has to know the elementary building blocks, i.e., the lexicon, of the language. Additionally, the system needs to model the relations between these lexical units. Many existing NLP (natural language processing) applications make use of words as such units. For instance, in statistical language modelling, probabilities of word sequences are typically estimated, and bag-of-word models are common in information retrieval. null However, for some languages it is infeasible to construct lexicons for NLP applications, if the lexicons contain entire words. In especially agglutinative languages,1 such as Finnish and Turkish, the 1In agglutinative languages words are formed by the concatenation of morphemes.</Paragraph>
    <Paragraph position="1"> number of possible different word forms is simply too high. For example, in Finnish, a single verb may appear in thousands of different forms (Karlsson, 1987).</Paragraph>
    <Paragraph position="2"> According to linguistic theory, words are built from smaller units, morphemes. Morphemes are the smallest meaning-bearing elements of language and could be used as lexical units instead of entire words. However, the construction of a comprehensive morphological lexicon or analyzer based on linguistic theory requires a considerable amount of work by experts. This is both time-consuming and expensive and hardly applicable to all languages. Furthermore, as language evolves the lexicon must be updated continuously in order to remain up-to-date.</Paragraph>
    <Paragraph position="3"> Alternatively, an interesting field of research lies open: Minimally supervised algorithms can be designed that automatically discover morphemes or morpheme-like units from data. There exist a number of such algorithms, some of which are entirely unsupervised and others that use some knowledge of the language. In the following, we discuss recent unsupervised algorithms and refer the reader to (Goldsmith, 2001) for a comprehensive survey of previous research in the whole field.</Paragraph>
    <Paragraph position="4"> Many algorithms proceed by segmenting (i.e., splitting) words into smaller components. Often the limiting assumption is made that words consist of only one stem followed by one (possibly empty) suffix (D'ejean, 1998; Snover and Brent, 2001; Snover et al., 2002). This limitation is reduced in (Goldsmith, 2001) by allowing a recursive structure, where stems can have inner structure, so that they in turn consist of a substem and a suffix. Also prefixes are possible. However, for languages with agglutinative morphology this may not be enough.</Paragraph>
    <Paragraph position="5"> In Finnish, a word can consist of lengthy sequences of alternating stems and affixes.</Paragraph>
    <Paragraph position="6"> Some morphology discovery algorithms learn relationships between words by comparing the orthographic or semantic similarity of the words (Schone and Jurafsky, 2000; Neuvel and Fulop, 2002; Baroni et al., 2002). Here a small number of components per word are assumed, which makes the approaches difficult to apply as such to agglutinative languages.</Paragraph>
    <Paragraph position="7"> We previously presented two segmentation algorithms suitable for agglutinative languages (Creutz and Lagus, 2002). The algorithms learn a set of segments, which we call morphs, from a corpus.</Paragraph>
    <Paragraph position="8"> Stems and affixes are not distinguished as separate categories by the algorithms, and in that sense they resemble algorithms for text segmentation and word discovery, such as (Deligne and Bimbot, 1997; Brent, 1999; Kit and Wilks, 1999; Yu, 2000). However, we observed that for the corpus size studied (100 000 words), our two algorithms were somewhat prone to excessive segmentation of words.</Paragraph>
    <Paragraph position="9"> In this paper, we aim at overcoming the problem of excessive segmentation, particularly when small corpora (up to 200 000 words) are used for training.</Paragraph>
    <Paragraph position="10"> We present a new segmentation algorithm, which is language independent and works in an unsupervised fashion. Since the results obtained suggest that the algorithm performs rather well, it could possibly be suitable for languages for which only small amounts of written text are available.</Paragraph>
    <Paragraph position="11"> The model is formulated in a probabilistic Bayesian framework. It makes use of explicit prior information in the form of probability distributions for morph length and morph frequency. The model is based on the same kind of reasoning as the probabilistic model in (Brent, 1999). While Brent's model displays a prior probability that exponentially decreases with word length (with one character as the most common length), our model uses a probability distribution that more accurately models the real length distribution. Also Brent's frequency distribution differs from ours, which we derive from Mandelbrot's correction of Zipf's law (cf. Section 2.5). Our model requires that the values of two parameters be set: (i) our prior belief of the most common morph length, and (ii) our prior belief of the proportion of morph types2 that occur only once in the corpus. These morph types are called hapax legomena. While the former is a rather intuitive measure, the latter may not appear as intuitive. However, the proportion of hapax legomena may be interpreted as a measure of the richness of the text. Also note that since the most common morph length is calculated for morph types, not tokens, it is not independent of the corpus size. A larger corpus usually requires a higher average morph length, a fact that is stated for word lengths in (Baayen, 2001).</Paragraph>
    <Paragraph position="12"> As an evaluation criterion for the performance of our method and two reference methods we use a measure that reflects the ability to recognize real morphemes of the language by examining the morphs found by the algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML