File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/w94-0208_intro.xml

Size: 12,778 bytes

Last Modified: 2025-10-06 14:05:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0208">
  <Title>SEGMENTING SPEECH WITHOUT A LEXICON: THE ROLES OF PHONOTACTICS AND SPEECH SOURCE</Title>
  <Section position="3" start_page="0" end_page="85" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Ilifants lUllSt Icarni I,o recognize ccrtain sound seqllellCl'.s ;IS I)(~illg words; this is a dillicult i)rob-Icn~ I)ecausc norllial speech contains no obvious acoustic divisions between words. Two sources of hifornlation that liiighl, aid Sl)coch segnierltal,ion arc: disl.ribullion I,hC/ I)holienic s(;qilCli~;e in &lt;'.l alillC:u's frcqli(~jil,ly in scw'ral contcxl,s includillg Ihc~'al, cats allil ('fl/ll(li?J, wlicl'(~a~ I.li(~ s(~lillCll(~e in i'(ilu is rti.l'l, illllll alillCal's ill rl,stricl,ed Colll.l'xts; :ilill llli(ui~ll,:l,cl.i~'s cal is rill acl'(qil,al)h~ syllabl(~ in I'hilxJish, wh&lt;,l'Clis p&lt;'lll is not. While evidcnc(' (~x-isl,s I.li~d. infanl.s a, rc scnsitiv(' to I.hcsc ili\['ornlal;ioli s~ili'l:(.s, wl, kll(~w of iio Illea.SllrelllelltS of I, heir IISC/;-I'uhi~'ss. In this paper, we attempt to quantify the ils~flllli~ss OF distribution and phonotactics in seglil(,litillg spe(,ell. W(' found thai, each source provi(Icd Solue IlSCfllJ information for speech seginen-I,atioli, bill I, li(' colirbiiiation of sources provided subsl,anl,ial hiforliiation. Wc also fonnd that childdir~cl,('d Slmech was Uulch ea.~icr to soglnenl, than adult-directed speech when using both sources.</Paragraph>
    <Paragraph position="1"> 'Fo date, psychologists have focused on two aspects of the speech segmentation problem. The first is the problem of parsing continuous speech into words given a developed lexicon to which incoming sounds can be matched; both psychologists (e.g., Cutler &amp; Carter, 1987; Cutler &amp; Butterliel(I, 1992) and designers of speech-recognition systems (e.g., (\]hur(:h, 1987) have examined I~his problem. However, the problem we examined is dilferent---we want to know how infants segment speech before knowing which phonemic seqllelW,('s form words. '1'he second aspect psychologists liaw~ focnsed (ill is the lirobleln of dcternihiilig the ill\['Orluatioll SOllr(:(~s t() which ilifants are SCllSil,ive. Priluarily, I.wo sotircos haw~ ll(~ell (~x-ainine~l: prosody and word stress. II,enults suggest I.hal, parents (~xaggcrate prosody in child-directed speech to highlight iniportant words (Fernahl &amp; Mazzie, 1991; Aslin, Woodward, LaMendola &amp; Bever, in press) and that infants are sensitive to prosody (e.g., Hirsh-Pasek et al., 1987).</Paragraph>
    <Paragraph position="2"> Word stress in English fairly accurately predicts the location of word beginnings (Cutler &amp; Norris, 1988; Cutler &amp; Butterfield, 1992); Jusczyk, Cutler and II,edanz (1993) demonstrated that 9-monthohls (but not 6-month-olds) are sensitive to the common strong/weak word stress pattern in English. Sensitivity to native-language phonotactics in 9-month-olds was re(:ently reported by Jusczyk, I,'riedcrici, Wessels, Swmkerud and Jusczyk (1993).</Paragraph>
    <Paragraph position="3"> 'i'lles~ sl, udi(~s deruoilstratcd infants' perceptive abilil.il's wil,hout deiilonsl.ral,hig tlw usefuhicss of hli'alil,s &gt; ll(~rcel)l,ioils.</Paragraph>
    <Paragraph position="4"> I low do childl'(,n coiubine l,li(: iiiforiii;d,ion I, hey i)crc~,iw; froln dilrerenl, SOlll'l;es'. ? Aslili el, al. Sl)(~c-Illate that infants first learn words heard in isolation, then use distribution and prosody to refine and expand their w)cabulary; however, Jusczyk (1(,)93) sliggests that sound sequences learned in isolation dill~r too greatly from those in contexi.</Paragraph>
    <Paragraph position="5"> to bc useful. He goes on to say, &amp;quot;just how far inforniation in the sound structure of the input can  pies, we see that Hypothesis 1 uses 48 characters and Hypothesis 2 uses 75. However, this simplistic method is inefficient; for instance, the length of lexical indices are arbitrary with respect to properties of the words themselves (e.g., in Hypothesis 2, there is no reason why/jul/was assigned tile index '10'--length two--instead of '9'--length one).</Paragraph>
    <Paragraph position="6"> Our system improves upon this simple size metri(: I)y coml)uting sizes based on ;t ('Onll)act rel)rcs(,ntat.ion motivated I)y informati(m theory.</Paragraph>
    <Paragraph position="7"> W(: inmginc hypothes(:s r(qu'(~sented ;~ a string of ones and zeros. This binary string must r(,present not only the lexical entries, their indices (called code words) and the coded sample, but also overhead information specifying the number of items coded and their arrangement in the string (information implicitly given by spacing and sl)atial placement in the introductory cxamples). Furtherrnore, the string and its components must be self-delimiting, so that a decoder could identify the endpoints of components by itself. The next section describes the binary representation and the length formulm derived from it in detail; readers satisfied with the intuitive descriptions presented so far should skip ahead to the Phonotactics subsection. null Representation and Length Formulae The representation scheme described below ix I);~scd on information theory (for more examples of coding systems, see, e.g., Li L: VitKnyi, 1993 and Quinlan &amp; Rivest, 1989). From this representation, we can derive a formula describing its length in bits. However, the discrete form of the formula would not work well in practice for our simulations. Instead, we use a continuous approximation of the discrete formula; this approximation typically involves dropping the ceiling function from length computations. For example, we sometimes use a self-delimiting representation for integers (as described in Li &amp; VitS.nyi, pp. 74-75).</Paragraph>
    <Paragraph position="8"> In this representation, the number of bits needed to code an integer x is given by</Paragraph>
    <Paragraph position="10"> lIowever, we use the following approximation:</Paragraph>
    <Paragraph position="12"> Using the discrete formula, the dilference I)etwc(,n g(21(126) and g(2)(127) is zero, while the difference between e(~)(127) and g(21(128) is one bit; using the continuous formula, the difference between ~(~)(126) and g(2)(127) is 0.0156, while the differ(m(:e I)ct.wecn g~)( 1271 and g(2)(128) is 0.0155. We f(mn(I it easier to inl.m'l)ret tim results using a cont.imu)us fun(:ti(m, s,) in t,lw J'~dh)witlg(liscussion, w(' will only i)r('s(.ut. I.h(. a.i)l)roxim;d.(~ fi~rmuh,~.</Paragraph>
    <Paragraph position="13"> 'rite lexicon lists words (represented as phoneme sequences) paired With I,Imir code words 1 . For (,xample:  Ill the hhm, ry relu'esentation , the two rohmms a, re represented separately, one ;ffter the other; tim first column is called the word inventory column; the second column is called the code word inventory column.</Paragraph>
    <Paragraph position="14"> In the word inventory colunul (see Figure la for a schematic), the list of lexical items is rel)r('sented as a continuous string of i)honemes, without separators between words (e.g., ~;)kmtkltisi...).</Paragraph>
    <Paragraph position="15"> To mark tile boundaries between lexical items, the phoneme string is preceded by a list of integers representing the lengths (in phonemes) of each word. Each length is represented am a. lixcd-length, zero-padded binary number, l'rceeding this list is a single integer denoting the length of each length field; this integer is represented in unary, so that its length need not be known in adwmce. Preceding the entire column is the numl)er of h,xica.I entries n codc(I as a self-dclimiting integer.</Paragraph>
    <Paragraph position="16"> The length of the representation of I.he integer n is given by the fimction</Paragraph>
    <Paragraph position="18"> We define len(wi) to be the mmlber of phonemes ill word wi. If there are p total unique phonemes used in tile sample, titan wc represent each phoneme as a fixed-length bit string of length len(p) = log 2 p. So, the length of the representation of a word wi in tile lexicon is the mnnber of phonemes in the word times the length of a phoneme: len(p), len(wi). The total length of all the words in the lexicon is tile sum of this formula over all lexical items:</Paragraph>
    <Paragraph position="20"> As stated al)ovc, the length liehls used to divide the phoneme string are lixe(Mcugth, lu e;u'h field is an integer I)etween one an(I the munl)er of phonemes in the longest word. Since repres(mtitlg integers between one and x takes log2 x bits, tim length of each field is: tog~(,;!?,~ t.,(,,,,)) I( ',ode words ;tl'e I'elWeSelfl,ed I)y Sqllitl'4, br;wi(qq,s, so \[:v\] means %he (:ode won'd coro'eSl)C,lldintl4 I,o :r'. T I.)otsl.ral) I.hc acquisition of other levels \[of linguisl.ic organization\] remaius to be determined.&amp;quot; In this paper, we measure the potential roles of dis-I,ribution, phonotactics and their combination using a computer-sitnulated learning algorithm; the simulation is based on a bootstrapping model in which phonotactic knowledge is used to constrain the distributional analysis of speech samples.</Paragraph>
    <Paragraph position="21"> While our work is in part motivated by the above research, other developmental research supports certain ;assumptions we make. The input to our system is represented as a sequence of i)houenms, so we implicitly assume that infants are aisle I.o ,'ouv('rl, from acoustic inl)ut to phoneme sequem:es; research i)y Kuhl (e.g., Gricser &amp; Kuhl, 1989) suggests tha.t this assmnl)tion is remsonal)h,. Since sentence I)oundaries provide informal.ion ahout word I)oumlaries (the end of a sentence is also the end of a word), our input contains sentence I~oumhu'ik~s; several studies (13ernstein-II.atm'r, 1985; Ilirsh-lh~sek et al., 1987; Kemler Nels~m, I lirsh-I'asek, ,lusczyk &amp; Wright C;msidy, 1989; ,I usczyk et al., 1992) have shown that infimts can perceive senl,cncc I)oundarics using prosodic cues. Ih)wever, FiSher and 'lbkura (in press) found m) evidence that prosody can accurately predict word boundaries, .so the task of finding words remains. Finally, one might question whether in-Ikmts have the ability we are trying to model--that is, whether they can identify words embedded in sentences; Jusczyk and Aslin (submitted) found that 7 I/2-month-olds can do so.</Paragraph>
    <Paragraph position="22"> The Model To gain an intuitive understanding of our model, consider the f()llowing speech sample (transcripti{,u is in IPA): Orthogral)hy: I)o you see tim kitty? Se(' the kitty? I)o you like the kil,t,y'.~ Trauscril)l,ioil: (luj usiiS;~kl ti si(ioklti du.iulalk&amp;)klti There are many differeut ways to break this sampie into I)utative words (each particular segmenl, ation is called a segmentation hypothesis). Two sucll hypotheses a~re:  Segmentation 1: du ju si 59 klti si 5~ klti du ju lalk 5,3 klti S&lt;:gnmnfl,ation 2: duj us i5 ~)klt i sic) ~)k nti (lu jul alk ()ok Iti lasting I, he wor(Is used I)y each segmentation hyi,othcsis yMds the Ibllowing two lexicons:  Note that Segmentation 1, the correct hypothesis, yields a compact lexicon of frequent words whereas Segmentation 2 yields a much larger lexicon of infrequent words. Also note that a lexicon contains only the words used in the sample--no words are known I.o tim system a priori, nor are any carried ow;r from one hypothesis to the next. Given a lexicon, tim saml)le can I)e encoded by ret)lacing words with their respective indices into the lexicon:  Encoded Sample l: 1, 6, 5, 2, 3; 5, 2, 3; l, 6, 4, 2, 3; Encoded Saml)le 2: 2, 11_2, 6, 4, 5; 11, 3, 8; 1, 9, 10, 7, 8;  Our simulation attempts to find the hypothesis that minimizes the combined sizes of the lexicon and encoded sample. This approach is called the Minimum Description Length (MDL) paradigm and has been used recently in other domains to analyze distributional information (Li &amp; Vitgnyi, 1993; Rissanen, 1978; Ellison, 1992, 1994; Brent, 1993). For reasons explained in the next section, the system converts these character-based representations to compact binary representations, using the number of bits in the binary string as a Ine~u re of size.</Paragraph>
    <Paragraph position="23"> I)imnotac(.ic rules can I)e used to restrict tim s(wnenl,al,ion hyl)ol, hesis Sl)ace I)y preventing word I)ountlari(,s a.t certain places; for instance, /ka,l,sp:)z/ (&amp;quot;,:at's paws&amp;quot;) has six i,,ternal s(~gmental.ion I)oints (k ;~l,Sl):)z, ka: t.sl):)z, el.c), only two of which are I)honotactically allowed (ka:t Sl):)z and kmts 1)3z). '17o evaluate the usefuhmss of phonotactic knowledge, we compared results between phonotactically constrained and unconstrained simulations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML