File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1018_metho.xml

Size: 22,502 bytes

Last Modified: 2025-10-06 14:14:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1018">
  <Title>Unsupervised Discovery of Phonological Categories through Supervised Learning of Morphological Rules</Title>
  <Section position="3" start_page="95" end_page="95" type="metho">
    <SectionTitle>
2 Supervised Rule Induction with
</SectionTitle>
    <Paragraph position="0"> For the experiments, we used C4.5 (Quinlan, 1993). Although several decision tree and rule induction variants have been proposed, we chose this program because it is widely available and reasonably well tested. C4,5 is a TDIDT (Top Down Induction of Decision Trees) decision tree learning algorithm which constructs a decision tree on the basis of a set of examples (tit('. training set). This decision tree has tests (feature names) as nodes, and feature values as branches between nodes. The leaf nodes are labeled with a category name and constitute the output of the system. A decision tree constructed on the basis of examples is used after training to assign a class to patterns.</Paragraph>
    <Paragraph position="1"> To test whether the tree has actually learned the problem, and has not just memorized the items it was trained on, the 9eneralization accuracy is measured by testing the learned tree on a part of the dataset not used in training.</Paragraph>
    <Paragraph position="2"> The algorithm for the construction of a C4.5 decision tree can be easily stated. Given are a training set T (a collection of examples), and a finite number of classes C1 ... C~.</Paragraph>
    <Paragraph position="3">  1. If T contains one or more cases all belonging to the same class Cj, then the decision tree for 5/&amp;quot; is a leaf node with category Cj.</Paragraph>
    <Paragraph position="4"> 2. If T is empty, a category has to be found on the basis of other information (e.g. domain knowledge). The heuristic used here is that the most frequent class in the initial training set is used.</Paragraph>
    <Paragraph position="5"> 3. If T contains different classes then (at Choose a test (feature) with a finite num null ber of outcomes (values), and partition T into subsets of examples that have the same outcome for tim test chosen. The decision tree. consists of a root node containing the test, and a branch for each outcome, each bt'anch leading to a sub-set of the original set.</Paragraph>
    <Paragraph position="6"> (b) Apply the procedure recursively to sub-sets created this way.</Paragraph>
    <Paragraph position="7"> In this algorithm, it is not specitied which test to choose to split a node into sut)trees at some point. Taking one at random will usually result in large decision trees with poor generalization performanee, as uninformative tests may be chosen. Considering all possible trees consistent; with the data is computationally intractable, so a reliable heuristic test selection method has to be found. The method used in C4.5 is based on the concept of mutual information (or information gain). Whenever a test has to be selected, the feature is chosen with the highest information gain. This is the feature that reduces the information entropy of the training (sub)set on average most, when its value would be known. For the computation of information gain, see Quinlan (1993).</Paragraph>
    <Paragraph position="8"> Decision trees can be easily and automatically transformed into sets of if-then rules (production rules), which are in general easier to understand by domain experts (linguists in our case). In C4.5 this tree-to-ruh; transformation involves additional statistical evaluation resulting sometimes in a rule set more understandable att(.l accurate than the corresponding decision tree.</Paragraph>
    <Paragraph position="9"> The C4.5 algorithm also contains a value grouping method which, on the basis of statistical information, collapses different values for a feature into the same category. That way, more concise decision trees and rules can be produced (instead of sew'~ral different branches or rule conditions for each wflue, only one branch or condition has to be detined, making reference to a (;lass of values). The algorithm works as a heuristic search of the search space of all possible partitionings of the wdues of a particular tbature into sets, with the for-Ination of homogeneous nodes (nodes representing examples with predominantly the same category) as a heuristic guide. See Quinlan (1993) for more information.</Paragraph>
  </Section>
  <Section position="4" start_page="95" end_page="96" type="metho">
    <SectionTitle>
3 Diminutive Formation in Dutch
</SectionTitle>
    <Paragraph position="0"> In the remainder of this t)ape.r, we will describe a case study of using C4.5 to test linguistic hy1)otheses attd to discover regularities and categories. Tit(,. case study concerns allomorphy in Dutch diminutive formation, &amp;quot;one of the more vexed probleins of l)utch i,honology (...) \[and\] one of the most spectacular phenomena of modern Dutch morphophonemics&amp;quot; (Trommelen 1983).</Paragraph>
    <Paragraph position="1"> Diminutive forlnation is a productive morphological rule in Dutch. Diminutives are formed by attaching a form of the Germanic sntfix -tje to t;he singular base form of a noun. The suffix shows allomorphic variation (Table 1).</Paragraph>
    <Paragraph position="2">  tives.</Paragraph>
    <Paragraph position="3"> The fi'equency distribution of the different categories is given in Table 2. We distinguish between database frequency (frequency of a suffix in a list, of 3900 diminutive forms of nouns we took from the CELEX lexical database 1) and corpus ~Developed by tile Center for Lexical Ilfformation, Nijmegen. l)istributed by tile Linguistic Data Consortium.</Paragraph>
    <Paragraph position="4">  frequency (frequency of ~ sutfix in the text corpus on which the word list was based).</Paragraph>
    <Paragraph position="6"> morphs.</Paragraph>
    <Paragraph position="7"> llistoricnlly, dilh;rcnt a.nalyses of diminutive forreal;ion }tav(~ taken a (lifferenl, view of tile rules thai; goveru the (:hoi(:(', of 1;he diminutiv(~ sullix, and ot! the, linguistic con(:el)l;s playing a role in these rules (see, e.g. T(; Winkel 11866, Kruizinga 1(.t15, Cohen 11958, and l'ef('~t'ellces ill Tl'OItlillt',lell 1983). In t;ho, lal;1;er, il; ix argued l;hal; (limimll;ive formation ix a local 1)recess, in which collCel)l;s such as word stress and morphological st, rll(;l,llre (proposed ill l;he earlier analyses) (1() not play a r()le. The rhyme of the last syllabic of tim noun is necessary and sutlicienl; t(/ predict I;}m col'l'(~cl; a/lomort)h. The, nal;uraJ (:ategorics (or feal,ures) wlfi(:h are hyllothesised in her rules in(:lu(h', obst, r'uents, .sonorwnl,.% alld the (:lass of bimoraic vowels (consisting of long vowels, diphtongs and schwa).</Paragraph>
    <Paragraph position="8"> Diminutive formation is a. Slna\[l linguisl;i(: (lomain for which different COmlmting l,hcories have  }men pr(/t)os('xl ~ &amp;ll(\[ fol' whi(:ll (liff(',r(~nt generalizal;ions (in terms of rules and linguistic categories) have been proposed. What we will show tw, x~; is how machine learning techniques tllay t)(~ llSed I;O (i) test competing hyi)otheso~s, (ii) discovc, r gene, ralizations in the data whi(;h c}tIl I;ll(}II t)e comt/are(1 to the generMizal;ions formulated })y linguists, aim (iii) discover phonologi(:al categories in ml unsupervised way by supervised learning of diminutive suttix prediction.</Paragraph>
    <Paragraph position="9"> 4: Experiments li'or ea(:h of l,he 3900 nouns we coll(!cted, th(! following information was kept.</Paragraph>
    <Paragraph position="10"> 1. The phoneme transcription describing the syllable structure (in terms of onset, nucleus, and coda) of l;he last three syllables of the word. Missing slots are imlicatexl with =.</Paragraph>
    <Paragraph position="11"> 2. D)r each of l;hese l;hree last syllables the, preselICe ()I' abse.llce of Sl;l'O, ss.</Paragraph>
    <Paragraph position="12"> 3. The (:orreslionding dimitmtive allomorph, abbreviated to E (-etjc), T (-tie,), ./ l-j(;), K (-Me), and I' (-pie). This is the' '(:al,egory' of the word to be learned by the learner.</Paragraph>
    <Paragraph position="13"> Some examples are given below (l;he word itself and its gloss are provided for convenience and were not used in the exllerimenl,s ).</Paragraph>
    <Paragraph position="14"> - b i = - z @ = + m h nt J biezenmand (basket) ........ + b I x E big (pig) .... + b K = - b a n T bijbaan (side job) .... + b K = ~ b @ i T bijbel (bible)</Paragraph>
    <Section position="1" start_page="96" end_page="96" type="sub_section">
      <SectionTitle>
4.1 Experimental Method
</SectionTitle>
      <Paragraph position="0"> The, ext)(wim(ml;al set-u t) use(t in all eXl)Crin/(ml:s consisted of a ten-lbhl cross-wflid;ttion eXl)erimcnt (Weiss &amp; Kulikowski 1991). In this set-up, the database is partitioned l;en time~s, each with ;t diL \['orelll. 101~/ (If lll(~ dal;asel; as the tesl prot, mid the remaining 9/1% as training parL. For each C/)f l}te l,(',n simulations in our exp('~riinelll;s, I;h(~ l;esl; p;u't, was used to to, st go, ueralization perfornuu,:e. The success rate of an algoril;hm is obtained I)y cah:ulat;ing Ihc av(ua,r( ,, aCClll'/lCy (llllltll)(!l' O\[: l;(~SI, t)nt, -I,ern categories correctly predit:ted) over the l:en test sets in the ten-fold cross-validation eXlmrililO.n{;. null</Paragraph>
    </Section>
    <Section position="2" start_page="96" end_page="96" type="sub_section">
      <SectionTitle>
4.2 Learnal fility
</SectionTitle>
      <Paragraph position="0"> The, exp(~rim(mts show thai; the diminutive li~marion 1)roblem is learnMfle in a data-(/ri(ml;(~(l way (i.e. 1)y extraction of regularities \['rein (!xamlflCs , without, any a priori knowledge ahout~ the domain&amp;quot;). The average accuracy on unseen tx~st data of 98.4% should be compared to bast;line l)crforlnan(:e measures baso, d on tnolmbilit~y based guessing. This baseline would t)e an accura(:y of a.l)out 4()~ for this prol)h;m. This shows t;hat the tn'()l)h'm is a.lm(/st t)(;rlh(:tly h',aruabl(! I)y induction, It, shouhl 1)e noted that CI';I,I,;X contains a numl)(~r ()\[ coding (',trots, so that some (ff lhe ~wrong' all(mlOrl)hs \])r(',(li(:ted by the ma(:lfine h;arning system were actually (:(II'I'(~(;L, Wq did not correct for this.</Paragraph>
      <Paragraph position="1"> \]It the next; three secl;ions, we will describe Lhe resull;s of l;he (~xImrim(;nts; tirst on the 1;ask of (:Olllparing conlli(:ting l;he(/reti(:al hypotheses, then on discoverittg linguistic gen(;ralizaLions, and flintily (m unsul)(~rvis(~(l dis(:overy of l/h(/nologica.l cat(&gt; gories.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="96" end_page="97" type="metho">
    <SectionTitle>
5 Linguistic Hypothesis Testing
</SectionTitle>
    <Paragraph position="0"> ()n the basis of the analysis of I)utch diminutive formation by TronuneJen (1983), discussed brietly in SecLion 3, Lhe following hypotheses (among oth- null ers) can be \[brnmlated.</Paragraph>
    <Paragraph position="1"> 1. Only informatioil about the last, syllable is re, levant in predicting the, correct allomorph. 2. \[nlormation about l;he onset of the last syllabi(, is irrelevant in predicting the, correct allomorph. null 3. Stress is irrelevant in predicting l;he correct allomorph.</Paragraph>
    <Paragraph position="2"> :~lCxcepl; syllMde stru(:tm-e,  In other words, information about the rhyine of the last syllable of a noun is necessary and sufficient to predict the correct allomorph of the diminutive suffix. To test these hypotheses, we performed four experiments, training and testing the C4.5 machine learning algorithm with fore' different corpora. These corpora contained the following information.</Paragraph>
    <Paragraph position="3">  1. All information (stress, onset, nucleus, coda) about the three last syllables (3-SYLL corpus). null 2. All information about the last syllable (SONC corpus).</Paragraph>
    <Paragraph position="4"> 3. Information about the last syllable without stress (ONC corpus).</Paragraph>
    <Paragraph position="5"> 4. Information about the last syllable without stress and onset (NC corpus).</Paragraph>
  </Section>
  <Section position="6" start_page="97" end_page="98" type="metho">
    <SectionTitle>
5.1 Results
</SectionTitle>
    <Paragraph position="0"> Table 3 lists the learnability results. The generalization error is given for each allomorph for the four different; training corpora.</Paragraph>
    <Paragraph position="1">  The overall best results are achieved with the most elaborate corpus (containing all information about; the three last syllables), suggesting that, eontra Trommelen, important information is lost by restricting attention to only the last syllable. As far as the different encodings of the last syllable are concerned, however, the learnability experiment coroborates Trommelen's claim that stress and onset are not necessary to predict the correct diminutive allomorph. When we look at the error rates for individual allomorphs, a more complex picture emerges. The error rate on -etje dramatically increases (from 7% to 14%) when restricting information to the last syllable. The -k~e allomorph, on the other hand, is learned perfectly on the basis of the last syllable alone. What has happened here is that the learning method has overgeneralized a rule predicting -kje after the velar nasal, because the data do not contain enough information to correctly handle the notoriously difficult opposition between words like leerling (pupil, takes -etje) and koning (king, takes -kje). Purthermore, the error rate on -pje is doubled when onset information is left out from the corpus.</Paragraph>
    <Paragraph position="2"> We can conch;de from these experiments that although the broad lines of the analysis by Trommelen (1983) are correct, the learnability results point at a number of problems with it (notably with -kje versus -etje and with -pje). We will move now to the use of inductive learning algorithms as a generator of generalizations about the domain, and compare these generalizations to the analysis of Trommelen.</Paragraph>
    <Section position="1" start_page="97" end_page="98" type="sub_section">
      <SectionTitle>
6 Supervised Learning of
Linguistic Generalizations
</SectionTitle>
      <Paragraph position="0"> When looking only at the rhyme of the last syllable (the NC corpus), the decision tree generated by C4.5 looks as follows: Decision Tree: coda in {rk,nt,lt,rt,p,k,t,st,s,ts,rs,rp,f, x, ik,Nk,mp, xt,rst,ns ,nst, rx,kt, ft, if ,mr, Ip,ks, is,kst, ix} : J coda in {n,=,l,j,r,m,N,rn,rm,w,lm}: nucleus in {I,A,},O,E}: coda in {n,l,r,m}: E coda in {=,j,rn}: T coda in {rm,lm}: P</Paragraph>
      <Paragraph position="2"> Notice that the phoneme representation used by CELEX (called DISC) is shown here instead of the more standard IPA font, and that the value grouping mechanism of C4.5 has created a mnnber of phonological categories by collapsing different phonemes into sets indicated by curly brackets.</Paragraph>
      <Paragraph position="3"> This decision tree should be read as follows: first check the coda (of the last syllable). If it ends in an obstruent, the allomorph is -jc. If not, check tile nucleus. If it is bimoraic, and the coda is/m/, decide -pje, if the coda is not/m/, decide -tje. When the coda is not an obstruent, the nucleus is short and the coda is /ng/, we have to look at the nucleus again to decide between -kje and -etje (this is where the overgeneralization to -kje for words in -ing occurs). Finally, the coda (nasa-liquid or not) helps us distinguish between -etje and -pje for those cases where the nucleus is short. It should be clear that this tree can easily be formulated as a set of rules without loss of accuracy.</Paragraph>
      <Paragraph position="4"> An interesting problem is that the -etje versus -kje problem for words ending in -ing couht hot be solved by referring only to the last syllable (C4.5 and any other statistically based induction algorithm overgeneralize to -kjc). The following is the knowledge derived by C4.5 t'rofll the flfll corpus, with all information about the three last syllables (the 3 SYLL corpus). We provide the rule version of the inferred knowledge this time.</Paragraph>
      <Paragraph position="5">  The default class is -tjc, which is the allomorph chosen when none of the other rules apply. This explains why this rule set looks simi)h'.r than tit(; decision tree earlier.</Paragraph>
      <Paragraph position="6"> The first thing which is interesting in this rule set, is that only tlu'ee of the twelve presented features (coda an(1 nllclelts of (;lie last syllable, nllcleus of the i)emlltimate syllal)le) m'e used in the rules. Contrary to the hyi)oth(;sis of Trommelen, apart from the rhyme of the last sylbfl)le, the m&gt; (:\[eus of the pemfltimate sylhd)le is taken to \])e re.levant ;~s well.</Paragraph>
      <Paragraph position="7"> The induced rules roughly correspond to the previous decision tree, but; in ad(lition a solution is provided to the -etje versus -kje problem for words ending in -in9 (rule 3) making use of information about the nucleus of the. t)emfltiInate syllabi(;. Rule 3 states that words ending in /ng/ get -etjeas (liminutive alloinorl)h when they are monosyllables (nucleus of the penultimate syllable is empty) or when they have a schwa as t)(multi mate rainless, and -kjc othe, rwise. As fro as we now, this generalization has not been prot)osed in this form in the lmblished literature on diminutive formation.</Paragraph>
      <Paragraph position="8"> We conclude from this part of the experiment that the Inaehine learning inethod has suc(:ee(led in extracting a sophistieate(l set of linguistic rules from the examph'.s in a purely data-oriented way, an(l that these rules are formulated at a level that makes their use in the development of linguistic theories possible.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="98" end_page="99" type="metho">
    <SectionTitle>
7 Discovery of Phonological
Categories
</SectionTitle>
    <Paragraph position="0"> To structure the phoneme inventory of a language, linguists define features. '\['hese ekLIl be interpreted as sets of st)ee(:h sounds (categories): e.g. the category (or feature) labial groups those speech sounds that involve the lips as an a(:tive art\[culator. Speech sounds behmg to different categories, i.e., are defined by ditferent \['e~tures. l,',.g. t is voiceless, a coronal, and a stop. Categories t)roposed in phonology are inspired by articulatory, acoustic or tmreeptual phonetic ditferences between speech sounds. They are also proposed to allow an optimally concise or elegant formulation of rules for the description of phonological or mot'phological processes. E.g., the so-calleA major (:lass features (obstruents, nasals, liquids, glides, vowels) efficiently explain syllable structure eomput;ation, lint are of little use in the definition of rules describing assimilation. For ass\[re\[In\[ion, placu of mti(:ulation f(~atllr(~s arc t)est ilse(l. This situation has led to the t)roposa.l of many dillhrenC phonoh)gieal category systems.</Paragraph>
    <Paragraph position="1"> Whih; constructing the decision tre.e (see prey\[-Oils section), several t)honologically relevant cat(;gories are 'discovered' by the value grouping mechanism in C4.5, including the nasals, the liquids, the obstruents, the short vowels, mtd the bimoraic vowels. This last category corresponds completely with the (then new) category hypothesise, d by Trommelen and containing the long vowels, tit(; diphtongs att(l the s('hwa, in oth(;r words, the learning a.lgorithm has discovered this set of phonemes to 1)e a useful category in solving the (|iminut;ive formation problem t)y t)rovi(ling ml ex-I,e.nsional detinition of it (a lisl; of tim inst;ulees ()\[: I;he ea.tegory).</Paragraph>
    <Paragraph position="2"> This raises the question of the task-dependence of linguistic categories. Similar experiments in Dutch t)lural formation, for examt)le, fail to produce th(' (:atcgory of bimoraic vowels, and for some tasks, categori(:s show u t) which hi~vc no ontological status in linguistics. In other words, making category formation del)endent oil the task to t)e learned, unde.rmitms the. tratlitional linguistic ideas about absolute, task-indel)endent (and even 1;mguage-in(h',t)endeitt) categories. We present heI'e &amp; lI(!~,v methodology with which this flltl(lDomental issue in linguistics can t)(; investigated: category systems ext;racted for difl'erent tasks in different languages can be studied to see which categories (if any) truely have a universal status.</Paragraph>
    <Paragraph position="3"> This is subject tbr fllrther resem'ch. It wouhl also l)e use.rid to stu(ly the indu(:ed categories when intensional descriptions (feature represeutations) are used as input instead of extensional descrit)lions (phoitetnes).</Paragraph>
    <Paragraph position="4"> We also experimented with a siml)h;r alternative to the computationally complex heuristic category \[orma.tion algorithm used by (;4.5. This method is inspire(1 by machine learning work on wflue dif ference metrics (Stanfill &amp; Waltz, 1986; Cost &amp; Salzberg, :1993). Starting fl'om the training set of the sut)ervised learning exl)erinlent (the set ()f input ouq)ut mappings used by the system to extract rules), we selc(:t a particular feature (e.g. the coda of the last syllable), and comt)ute a table as- null sociating with each t)ossit)le value of tile feature the number of times the pattern in which it, occurs was assigned to each different category (in this case, each of the the five allomorphs). This produces a table with for each value a distribution over categories. This table is then used in standard clustering approaches to derive categories of values (in this case consonmlts). The following is one of these clustering results. The example shows that this computationally simple approach also succeeds in discovering categories in an unsupervised way on tile basis of data for supervised learning.</Paragraph>
    <Paragraph position="5">  ....... &gt; l I ....... &gt; r -I ............ &gt; n I---I ..... I ..... &gt; t I I I ..... I .... &gt; k I ............ I I .... &gt; s I ..... I--&gt; p I .... I I--&gt; f I ..... I .... &gt; m I .... I .... &gt; N I .... I--&gt; x</Paragraph>
    <Paragraph position="7"> Several categories, relevant for diminutive formation, such as liquids, nasals, the velar nasal, semi-vowels, fi'icatives etc., are reflected in this hierarchical clustering.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML