File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0208_metho.xml
Size: 10,521 bytes
Last Modified: 2025-10-06 14:13:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W94-0208"> <Title>SEGMENTING SPEECH WITHOUT A LEXICON: THE ROLES OF PHONOTACTICS AND SPEECH SOURCE</Title> <Section position="4" start_page="85" end_page="87" type="metho"> <SectionTitle> SIMULATION DETAILS </SectionTitle> <Paragraph position="0"> 'Ib use the MDL principle, as introduced above, we search for the smallest-sized hypothesis. We must have some well-defined method of measuring hypothesis sizes for this method to work. A silnllle, intuitive way of measuing the size of a hypothesis is to count the numl)er of characters used to rcl)resent it. \[:or example, counting the characters (cxclu(ling spaces) in the introductory exam- null 'lb be fully self-delinfiting, the width of a field must be represented in a self-delinfiting way; we use a unary representation--i.e., write an extra field consisting of only '1' bits followed by a terminating '0'. There are n fields (one for each word), plus the unary prefix, so the combined length of i,hc fields plus prefix (plus terminating zero) is:</Paragraph> <Paragraph position="2"> The total length of the word inventory column representation is the sum of the terms in (1), (2) and (.~).</Paragraph> <Paragraph position="3"> The code word inventory column of the lexicon (see Figure lb for a schematic) has a nearly identical representation as the previous colmnn except that code words are listed instead of phonemic words--the length fields and unary prefix serve the same purpose of marking the divisions between code words.</Paragraph> <Paragraph position="4"> The sample can be represented most compactly by assigning short code words to frequent words, reserving longer code words for infrequent words. To satisfy this property, code words are assigned so that their lengths ar~ fre(luency-l);l~sed; the lengl.h of tim ,:ode word fi)r a word of I're(Itn~ncy f(',,) will not be greater than:</Paragraph> <Paragraph position="6"> The total length of the code word list is the sum of the code word lengths over all lexieal entries:</Paragraph> <Paragraph position="8"> As in the word inventory colmnn (described ;d)ove), the length of each code word is represented in a fixed-length field. Since the least frequent word will have the longest code word (a prol)erty of the formula for /cn(\[wi\])), the longest possible .code word comes from a word of frequency one:</Paragraph> <Paragraph position="10"> Sim'e t, he fields contains integers between one aud this ,mt,d)('r, w,&quot; ~lefit,o the length of a \[i('ld I,o I)(': I, ,g..,( I, ,g:~ .,) As above, we represent the width of a lield in unary, so there are a total of n + 1 elements of this size (n fields plus the unary representation of the field width). The combined length of the fields plus prefix (and terminating zero) is:</Paragraph> <Paragraph position="12"> The total length of the code word inventory column representation is tile sum of the l.errus ill {,1) and (5).</Paragraph> <Paragraph position="13"> Finally, the sequence of words which form tim sample (see Figure le for a schematic) is represented as the nurrd)er of words in the sample (m) followed by the list of code words. Since code words are used as compact indices into the lexicon, the original sample could I)e re(x)nstructed completely by looking up eacil code word in this list and replacing it with its phoneme sequence from the lexicon. The code words we assigned to lexical items are self-delimiting (once the set of codes is known), so there is no need to represent the boundaries between code words.</Paragraph> <Paragraph position="14"> The length of the representation of the iuteger m is given by I.h(~ fimction e(~)(,.) ((i) The length of the representation of the sanq)le is computed by summing the lengths of the code words used to represent the sample. We can simplify this description by noting that the combined length of all occurrences of a particular code word \[wi\] is f(wi), len(\[wi\]) since there are f(u,i) occurrences of the code word in the sample. So, the length of the encoded sample is the sum of this formula over all words in the lexicon:</Paragraph> <Paragraph position="16"> The total length of the sample is given by adding the terms in (6) and (7). The total length of the representation of the entire hyl)othesis is the sam of the rel)resentation lengt,hs of the word inw,ntory ('f)llnlnl, I,he code word illV(qltory ~'ohllllll ;Hid I,ho na.mpb'.</Paragraph> <Paragraph position="17"> This systein of ('olnputhig hyl)othesis sizes is ('llicielil, in the sense that elenlents ;ire thought of a.s being rel)resent;ed compactly and that (:ode words arc assigned based on the relative frequencies of words. '\['he'final evaluation given to a hypothesis is an estimate of the minimal number of bits required to transmit that hypothesis. As such, it pernfits direct comparison between competing hypotheses; that is, the shorter the representation of some hypothesis i the more distributional inforuiation can be extracted and, therefore, the better the hypothesis.</Paragraph> <Paragraph position="18"> Phonotactics I'honotactic knowledge was given to the system as a. list of licit initial and Iinal consoliant clusters of English words~; this list was checked against all six sanlples so thaPS the list was inaxinmlly pcrinissiv(' (e.g., I,li(~ underlined consonliut clusl,er in exllloi'e could I)e divhled as ek-splore or eks-plore). Ih-q\]iose sinnilittions which used the l)liouota(:tic knowledge., it word boundary could lrl()t be inserted when (Iohlg so would create a word initial or final (:onsonant chister not on the list or would create a word without a vowel. For example (from an actual sample--corresponds to the utterance, &quot;Want me to help baby?&quot;): Sample: wantmituhclpbebi VaJid Boundaries: want.mi.t.u.help.be.bi lit I,he s('('()li(I lille, I, hose word I)olluda, ries that a, re lihtiliiil,acl,ic;I.Ily nel-ial are iiliirk(;d wil, ii dots. The I>,,,,,.I;,.,.y I,,.i,w,,,.,, /w/and /a/ is ilh'gal I,eca,,s,, /w/ I,y itself is ii(fl, a legal wor(I in English; the I,,mmlary I>ctwcen /a/ an(I liil i~ illegal l)ecausc /ntm/ is n()t a va, lkl word inil,\[al (:ons(inant chlsto,'; th(; I)oundary between /m/ and /i/ is illegal I)('ca, llSe /iitiu/ iS also not a valid word liual COIISO:&quot; nant chml,er; i,h(' 1,01 ndary between/I)/and/b/is legid i:,e(:ause /11)/is a valid word linal (:luster and /I)/ is a valid word initial cluster. Note that using the l)honotactic constraints reduces the number of I)otential word boundaries from fifteen to six in this exaruple.</Paragraph> <Paragraph position="19"> After the system inserts a new word bounda ry, it updates the list of remaining valid insertion I)oints --adding a point may cause nearby points I.o I)cconm unusable clue to the rcstriction that every wor(I must llave a vowel. For example (corre-Sl)On(ling to the utterance &quot;green and&quot;): After the segmentation of/grin/ and /send/, the potential boundary between /i/ and /n/ becomes invalid because inserting a word boundary there would produce a word with no vowel (/n/).</Paragraph> <Section position="1" start_page="87" end_page="87" type="sub_section"> <SectionTitle> Inputs and Simulations </SectionTitle> <Paragraph position="0"> Two speech samples from each of three subjects were used in the simulations in one sample a mother was speaking to her daughter and in the other, the same mother was speaking to the researcher. The samples were taken from the CHILDES database (MacWhinney ~ Snow, 1990) from studies reported in Bernstein (1982). Each sample was checked for consistent word spellings (e.g., 'ts wits changed to its), then was transcribed into an ASCll-I)ased I)honemic rel)res(mtation :l.</Paragraph> <Paragraph position="1"> 'Fhe transcription sysl, em was based on IPA an(I used one character for each consonant or vowel; diphthongs, r-colored vowels and syllabic consonants were each represented as one character. For example, &quot;boy&quot; was written as bT, &quot;bird&quot; as bRd and &quot;label&quot; as lebL. For purposes of phonotactic constraints, syllabic consonants were treate(! as vowels. Sample lengths were selected to make the nmnber of available segmentation points nearly equal (about 1,350) when no ph0notactic constraints were applied; child-directed samples had 498-536 tokens and 153-166 types, adult-directed sa.ml)les had 443-484 tokens and 196--205 types.</Paragraph> <Paragraph position="2"> I&quot;inMly, Iwl'ore tim saml)les were fi~(I to the sinmlations, divisions bel,wcell words (but not l)(%w(~en s(HIt(qIcos) wcr(~ reiuovc'.(L The sl)ace of l)Ossil)le hyl)oi, hcses is vlmt 4, so sonl(~ nmthod of finding a minimum-length hypothesis without considering all hypotheses is necessary. We used the following method: first, evaluate the input sample with no segmentation points added; then evaluate all hypotheses obtained by adding one or two segmentation points; take the shortest hypothesis found in the previous step and evaluate all hypotheses obtained by adding one or two more segmentation points; continue this way until the sample has been segmented into the smallest possible units and report the shortest hypothesis ever found. Two variants of this simulation wcre used: (1) DIST-FREE was free of any phonotactic restrictions on the hypotheses it could form (DIST refers to the measurement of distri-</Paragraph> <Paragraph position="4"> there are about 2 lsSdeg ~ 2.5 x 104degs hypotheses.</Paragraph> <Paragraph position="5"> I&quot;,ach simulation was rim on ('a('h S;I.IIIpl('. for ;i. I.() tal of twelve DIST rllns.</Paragraph> <Paragraph position="6"> Finally, two other simulations were run on each sample to measure chance performance: .(1.) RAND-FREE inserted random segmentation points and reported the resulting hypothesis, (2) RAND-PItONO inserted random segmentation points where permitted by the phonotactic constraints. Since the RAND simulations were given the number of segmentation points to add (equal to the number of segmentation points needed to I)rodnce the natural English segmentation), their j)(~rrormance is an upper t)oml(I on chance pcrl'or-Illll.ll(:(;. hi C#)lltl';i.st, tim I)INT shrlnlatiollS nnlst determine I.im lllllilh(:r of SC~lll(~lll,;i.l.i()n poinl.s I.o a d(I using M I)1, ev;iJurtl.ious. Tim results for each I'~,ANI) sinlulatiou a.re averages over 1,000 trims Oil e:~ch input sample.</Paragraph> </Section> </Section> class="xml-element"></Paper>