XML Viewer - w06-3210

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3210_intro.xml
Size: 8,422 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3210">
  <Title>A Naive Theory of Affixation and an Algorithm for Extraction</Title>
  <Section position="2" start_page="0" end_page="80" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The problem at hand can be described as follows: Input : An unlabeled corpus of an arbitrary natural language Output : A (possibly ranked) set of prefixes and suffixes corresponding to true prefixes and suffixes in the linguistic sense, i.e well-segmented and with grammatical meaning, for the language in question.</Paragraph>
    <Section position="1" start_page="0" end_page="80" type="sub_section">
      <SectionTitle>
Restrictions : Weconsideronlyconcatenativemor-
</SectionTitle>
      <Paragraph position="0"> phology and assume that the corpus comes already segmented on the word level.</Paragraph>
      <Paragraph position="1"> The theory and practice of the problem is relevant or even essential in fields such as child language acquisition, information retrieval and, of course, the fuller scope of computational morphology and its further layers of application (e.g Machine Translation). null The reasons for attacking this problem in an unsupervised manner include advantages in elegance, economyoftimeandmoney(noannotatedresources required), and the fact that the same technology may be used on new languages.</Paragraph>
      <Paragraph position="2"> An outline of the paper is as follows: we start withsomenotationandbasicdefinitions, withwhich we describe the theory that is intended to model the essential behaviour of affixation in natural languages. Then we describe in detail and with examples the thinking behind the affix extraction algorithm, which actually requires only a few lines to define mathematically. Next, we present and discuss some experimental results on typologically different languages. The paper then finishes with a brief but comprehensive characterization of related work and its differences to our work. At the very end we state the most important conclusions and ideas on future components of unsupervised morphological analysis. null  2 A Naive Theory of Affixation Notation and definitions: * w,s,b,x,y,... [?] S[?]: lowercase-letter variablesrangeoverstringsofsomealphabetSand null are variously called words, segments, strings, etc.</Paragraph>
      <Paragraph position="3"> * s triangleleft w: s is a terminal segment of the word w i.e there exists a (possibly empty) string x such that w = xs * W,S,... [?] S[?]: capital-letter variables range over sets of words/strings/segments * fW(s) = |{w [?] W|s triangleleft w}|: the number of words in W with terminal segment s * SW = {s|s triangleleft w [?] W}: all terminal segments  of the words in W * |*|: is overloaded to denote both the length of a string and the cardinality of a set Assume we have two sets of random strings over some alphabet S:</Paragraph>
      <Paragraph position="5"> Such that: Arbitrary Character Assumption (ACA) Each character c [?] S should be equally likely in any word-position for any member of B or S. Note that B and S need not be of the same cardinality and that any string, including the empty string, could end up belonging to both B and S. They need neither to be sampled from the same distribution; pace the requirement, the distributions from which B and S are drawn may differ in how much probability mass is given to strings of different lengths. For instance, it would not be violation if B were drawn from a a distribution favouring strings of length, say, 42 and S from a distribution with a strong bias for short strings.</Paragraph>
      <Paragraph position="6"> Next, build a set of affixed words W [?] {bs|b [?] B,s [?] S}, that is, a large set whose members are concatenations of the form bs for b [?] B,s [?] S, such that: Frequent Flyer Assumption (FFA) : The members of S are frequent. Formally: Given any s [?] S: fW(s) &gt;&gt; fW(x) for all x such that 1.</Paragraph>
      <Paragraph position="7"> |x |= |s|; and 2. not x triangleleft sprime for all sprime [?] S). In other words, if we call s [?] S a true suffix and we call x an arbitrary segment if it neither a true suffix nor the terminal segment of a true suffix, then any true suffix should have much higher frequency than an arbitrary segment of the same length.</Paragraph>
      <Paragraph position="8"> One may legimately ask to what extent words of real natural languages fit the construction model of W, with the strong ACA and FFA assumptions, outlined above. For instance, even though natural languages often aren't written phonemically, it is not hard to come up with languages that have phonotactic constraints on what may appear at the beginning or end of a word, e.g, Spanish *st- may not begin a word and yields est- instead. Another violation of ACA is that (presumably all (Ladefoged, 2005)) languages disallow or disprefer a consonant vs. a vowel conditioned by the vowel/consonant status of its predecessor. However, if a certain element occurs with less frequency than random (the best example wouldbeclickconsonantswhich, insomelanguages e.g Eastern !X~oo (Traill, 1994), occur only initially), this is not a problem to the theory.</Paragraph>
      <Paragraph position="9"> As for FFA, we may have breaches such as Biblical Aramaic (Rosenthal, 1995) where an old --a element appears on virtually everywhere on nouns, making it very frequent, but no longer has any synchronic meaning. Also, one can doubt the requirement that an affix should need to be frequent; for instance, the Classical Greek inflectional (lacking synchronicinternalsegmentation)alternativemedial 3p. pl. aorist imperative ending -sthon (Blomqvist and Jastrup, 1998), is not common at all.</Paragraph>
      <Paragraph position="10"> Just how realistic the assumptions are is an empirical question, whose answer must be judged by experiments on the relevant languages. In the absense of fully annotated annotated test sets for diverse languages, and since the author does not have access to the Hutmegs/CELEX gold standard sets for Finnish and English (Creutz and Lind'en, 2004), we can only give some guidelining experimental data.</Paragraph>
      <Paragraph position="11"> ACA On a New Testament corpus of Basque (Leizarraga, 1571) we computed the probability of a character appearing in the initial, sec- null according to word position.</Paragraph>
      <Paragraph position="12"> ond, third or fourth position of the word. Since Basque is entirely suffixing, if it complied to ACA, we'd expect the distributions to be similar. However, if we look at the difference of the distributions in terms of variation distance between two probability distributions (||p[?]q ||=  summationtext x|p(x) [?] q(x)|), it shows that they differ considerably - especially the initial position proves more special (see table 1).</Paragraph>
      <Paragraph position="13"> FFA As for the FFA, we checked a corpus of bible portions of Warlpiri (Yal, 1968 2001). This was chosen because it is one of the few languages known to the author where data was available and which has a decent amount of frequent suffixes which are also long, e.g case affixes are typically bisyllabic phonologically and five-ish characters long orthographically. Since the orthography used marked segmentation, it was easy to compute FFA statistics on the words as removed from segmentation marking. Comparing with the lists in (Nash, 1980, Ch. 2) it turns out that FFA is remarkably stable for all grammatical suffixes occuring in the outermost layer. There are however the expected kind of breaches; e.g a tense suffix -ku combined with a last vowel -u which is frequent in some frequent preceding affixes making the terminal segment -uku more frequent than some genuine three-letter suffixes.</Paragraph>
      <Paragraph position="14"> The language known to the author which has shown the most systematic disconcord with the FFA is Haitian Creole (also in bible corpus experiments (Hai, 2003 1999)). Haitian creole has very little morphology of its own but owes the lion's share of it's words from French.</Paragraph>
      <Paragraph position="15"> French derivational morphemes abound in these words, e.g -syon, which have been carefully shown by (Lefebvre, 2004) not to be productive in Haitian Creole. Thus, the little morphology there is in Haitian creole is very difficult to get at without also getting the French relics.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML