File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-1009_evalu.xml

Size: 11,364 bytes

Last Modified: 2025-10-06 14:00:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1009">
  <Title>Evolution of a Rapidly Learned Representation for Speech</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> All results shown are from the best network evolved (fitness=0.45) after it had been trained on 30 English sentences corresponding to about 2 minutes of continuous speech. Figure 4 shows the response of this network to one of the TIMIT testing sentences.</Paragraph>
    <Paragraph position="1"> From the response of the feature units to speech sounds (see Figure 4) it was clear that some units were switched off by fricatives, and some units were switched on by voicing, so both excitation and inhibition play an important part in the functioning of the feature detectors. The feature unit responses did not seem to correlate directly with any other standard acoustic features (e.g. nasal, compact, grave, flat etc.). An analysis of the frequency response of the eight feature detectors (see Figure 5) showed that each unit had excitatory projections from several frequency bands. Generally, the frequency responses were mutually exclusive so that each unit responded to slightly different sounds, as one would expect.</Paragraph>
    <Paragraph position="2">  feature units to pure tones. Feature units 2 and 3 receive strong excitatory inputs from low frequencies (below 4 kHz) and are therefore activated by voicing.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Cross-Lingulstlc Performance
</SectionTitle>
      <Paragraph position="0"> In order to determine the cross-linguistic performance of the &amp;quot;innate&amp;quot; features evolved on English speech, sound files of the news in several languages were obtained from the Voice of America FTP site (ftp.voa.gov). Since phonological transcription files were not available for these files they could not be used to test the network, because the times of the phoneme mid-points were unknown. All the VOA broadcast languages 2 were used as training files, and the network was tested on 30 American English sentences found in the TIMIT speech files. The timecourses of development for four languages are shown in Figure 6. Maximum fitness was reached after training on any language for roughly 20 sentences (each lasting about 3 seconds).</Paragraph>
      <Paragraph position="1"> All of the human languages tested seemed to be equally effective for training the network to represent English speech sounds. To see whether any sounds could be used for training, the network was trained on white noise. This resulted in slower learning and a lower fitness. The fitness for a network trained on white noise never reached that of the same network trained on human speech. An even worse impediment to learning was to train on low-pass filtered human speech.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Categorical Perception
</SectionTitle>
      <Paragraph position="0"> Categorical perception of some phonemes is a robust phenomenon observed in both infants and  file test/dr3/fkms0/sxl40). Input units are fed with sound spectra and activate the feature units. Activity is shown as a greyscale (maximum activity is portrayed as black) with time on the horizontal axis. Phone and word start and end times as listed in TIMIT are shown in the bottom two panels. This is the same network as shown in Figure 5.</Paragraph>
      <Paragraph position="1"> 0 5 10 15 20 25 30 Nuntber d Traildng Sentences  nal value after presentation of just 20 sentences regardless of the language used to train the network.</Paragraph>
      <Paragraph position="2"> The six curves show the learning curves for a network tested on 30 sentences of English having been trained on English, Cantonese, Swahili, Farsi, white noise and low-pass filtered English.</Paragraph>
      <Paragraph position="3"> adults. We tested the network on a speech continuum ranging between two phonemes and calculated the change in the representation of the speech tokens along this continuum. Note that this model simply creates a representation of speech on which identification judgements are based. It does not identify phonemes itself. All that the model can provide is distances between its internal representations of different sounds. Categorical perception can be exhibited by this network if the internal representation exhibits non-linear shifts with gradual changes in the input i.e. a small change in the input spectrum can cause a large change in the activity of the output units.</Paragraph>
      <Paragraph position="4"> Using a pair of real /~/ and /s/ spectra from a male speaker, a series of eleven spectra were created which formed a linear continuum from a pure/.f/to a pure/s/. This was done by linearly interpolating between the two spectra, so the second spectrum in the continuum was a linear sum of 0.9 times the /.f/ spectrum plus 0.1 times the/s/spectrum. The next spectrum was a linear sum of 0.8 times the/.f/ spectrum plus 0.2 times the/s/spectrum, and so on for all nine intermediate spectra up to the pure/s/.</Paragraph>
      <Paragraph position="5"> Each of the eleven spectra in the continuum were individually fed into the input of a network that had been trained on 30 sentences of continuous speech in Nakisa ~ Plunkett 76 Evolution of Speech Representations English. The output feature responses were stored for each spectrum in the continuum. The distances of these feature vectors from the pure/J~/and pure /s/are shown in Figure 7.</Paragraph>
      <Paragraph position="6">  - /s/ continuum. Circles show the distance from a pure/J~/and triangles show the distance from a pure /s/. Clearly, the distance of the pure/j~/from itself is zero, but moving along the continuum, the distance from the pure/.~/increases steadily until it reaches a maximum for the pure/s/ (distances were scaled such that the maximum distance was 1). Figure 7 shows that the representation is non-linear. That is, linear variations in the input spectrum do not result in linear changes in the activity of the feature units. Compared to the spectral representation of the /J~/- /s/ continuum, the network re-represents the distances in the following ways: * There is a discontinuity in the distances which occurs closer to the/J~/than the/s/.</Paragraph>
      <Paragraph position="7"> * The distance from the representation of a pure /s/remains small for spectra that are a third of the way toward the pure/.~/.</Paragraph>
      <Paragraph position="8"> A classifier system using this representation would therefore shift the boundary between the two phonemes toward If~ and be relatively insensitive to spectral variations that occurred away from this boundary. These are the hallmarks of categorical perception.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Similarity Structure of the
Representation
</SectionTitle>
      <Paragraph position="0"> A consequence of any representation is its effect on similarity judgements. Miller and Nicely (1955) used this fact in an elegant experiment designed to infer the manner in which humans identify sixteen English consonants. They asked subjects to identify CV pairs where the consonant was one of the sixteen being tested and the vowel was/aI/, as in father. By adding noise to the stimuli at a constant loudness and varying the loudness of the speech they could control the signal to noise ratio of the stimuli and measure the number and type of errors produced. Subjects produced a consistent pattern of errors in which certain pairs of consonants were more confusable than others. For example, the following pairs were highly confusable: m-n, f-0, v-~, p-t-k, d-g, s-f, z- 5. When clustered according to confusability the consonants formed three groups: voiceless, voiced and nasal consonants. Confusability was greatest within each group and smallest between groups.</Paragraph>
      <Paragraph position="1"> Since our model did not classify phonemes it was not possible to create a phoneme confusability matrix using the same method as Miller and Nicely.</Paragraph>
      <Paragraph position="2"> However, it was possible to create a clustering diagram showing the similarity structure of the representations for each phoneme. If given noisy input, phonemes whose representations are closest together in the output space will be more easily confused than phonemes that lie far apart. Since a cluster analysis of many thousands of phoneme tokens would not be clear, a centroid for each phoneme type was used as the input to the cluster analysis. Centroids were calculated by storing the input and output representations of phonemes in 1000 TIMIT sentences. Cluster analyses for the spectral input representation and the featural output representation are shown in Figure 8. 3 From Figure 8 it is clear that the featural output representation broadly preserves the similarity structure of the spectral input representation despite the eight-fold compression in the number of units.</Paragraph>
      <Paragraph position="3"> In both the input and output representations the phonemes can be divided into three classes: fricatives/affricates, vowels/semi-vowels, and other consonants. Some phonemes are shifted between these broad categories in the output representation, e.g.</Paragraph>
      <Paragraph position="4"> t, 0 and f are moved into the fricative/affricate category. The reason for this shift is that t occurs with 3It should be noted that for stops, TIMIT transcribes closures separately from releases, so /p/ would be transcribed /pcl p/. The results shown here are for the releases, hence their similarity to fricatives and affricates. Nakisa ~ Plunkett 77 Evolution of Speech Representations a high token frequency, so by pulling it apart from other frequently occurring, spectrally similar consonants, the fitness is increased.</Paragraph>
      <Paragraph position="5"> Both spectral and featural representations showed a high confusability for m-n, f-0, d-g, s-J ~, as found in the Miller and Nicely experiments. There were discrepancies, however: the stops p-t-k were not particularly similar in either the input or output representations due to an artifact of the representations being snapshots at the mid-points of the stop release.</Paragraph>
      <Paragraph position="6"> In human categorisation experiments, phonemes are judged on the basis of both the closure and the release, which would greatly increase the similarity of the stops relative to other phonemes. In the input representation, v-6 are fairly close together, but are pulled apart in the output representation. Both these phonemes had low token frequencies, so this difference may not be a result of random variation.</Paragraph>
      <Paragraph position="7"> In Figure 8 3 is not shown because it occurred very infrequently, but the centroids of z- 3 were very close together, as found by Miller and Nicely.</Paragraph>
      <Paragraph position="8">  featural representations. Labels are TIMIT ASCII phonemic codes: dx-r, q-?, jh-d3, ch-~, zh-3, th-0, dh-0, em-rn, en-~, eng-~, nx-r, hh-h, hv-~, el!, iy-il, ih-I, eh-e, ey-ej, aa-ct, ay-aj, ah-A, ao-o, oy-3j, uh-u, uw-m, ux-u, er- U ax-o, ix-i, axr-~, ax-h-o.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML