XML Viewer - w97-1009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1009_metho.xml
Size: 17,534 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1009">
  <Title>Evolution of a Rapidly Learned Representation for Speech</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overview of the Model
</SectionTitle>
    <Paragraph position="0"> The goal of the model is to create a neural network that takes speech spectra as input and develops the same representation of speech whatever the language it is exposed to. Furthermore we avoid hard-wiring the connections in the network. Rather, the network employs a set of unsupervised learning rules that converge on the same representation whatever the initial set of connection strengths between neurons in the network. It is important that the learning is unsupervised as the developing infant has no teaching signal as to the contrasts present in speech.</Paragraph>
    <Paragraph position="1"> In essence this model of early speech perception embodies Waddington's (1975) principle of epigenesis, or what Elman et al. (1996) have more recently described as architectural/computational innateness.</Paragraph>
    <Paragraph position="2"> The approach we have taken is to encode the properties of neural networks in a genome and to evolve, by a process called a genetic algorithm, a population 0f neural networks that respond in the appropriate way to speech spectra. Initially, a population of 50 genomes are randomly generated. Each of these networks is presented with speech spectra and we quantify how well its neuronal engram of speech encodes the incoming signal. This number is called the &amp;quot;fitness&amp;quot; of a network. For the task of representing speech sounds we want a network that is responsive to the salient aspects of the speech signal, in particular those necessary for identification of speech segments. A network that is good at representing speech will encode tokens of the same acoustic segment as similarly as possible and different segments as differently as possible.</Paragraph>
    <Paragraph position="3"> The initial population performs very poorly on the task, but some networks perform better than others.</Paragraph>
    <Paragraph position="4"> Two parents are randomly selected from the population with a probability that increases with increasing fitness. The parental genomes are spliced together to form one child network that is then tested to find its fitness. The child network then replaces the network that has the lowest fitness in the population. Each gene also has a small chance of mutating to a new value after sexual reproduction, so that new genes are constantly entering the collective gene pool, otherwise the evolutionary process would simply be a re-shuffling of genes present in the initial population. The process of parental selection, sexual Nakisa 8J Plunkett 771 Evolution of Speech Representations reproduction, mutation of the offspring and evaluation of the offspring is repeated for several thousand generations. Genes that are useful for the task at hand, as specified by the fitness function, increase in frequency in the population, while genes that are not useful decline in frequency. Within a few hundred generations the networks in the population develop representations that have a high fitness value, as illustrated in Figure 1.</Paragraph>
    <Paragraph position="5">  lation with increasing number of generations, where a generation is defined as the production of one new network. Initially networks perform very poorly, but selection improves the population rapidly.</Paragraph>
    <Paragraph position="6"> Clearly, the encoding scheme used to store the properties of neural networks critically affects how well the networks may perform on any given task.</Paragraph>
    <Paragraph position="7"> The encoding scheme we have chosen is very flexible, storing information about the architecture of a network and its learning properties. Architecture defines what neurons may be connected to other neurons, and this presupposes some way of grouping neurons such that these gross patterns of connectivity can be defined. For the purposes of defining network architecture, therefore, the network is sub-divided into subnetworks. The genome specifies how many subnetworks there are, how many neurons are in each subnetwork what subnetworks are connected to one another, and given that two subnetworks are connected, what learning rule is used in connections between neurons in those subnetworks.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Description of the Model
</SectionTitle>
    <Paragraph position="0"> The model builds on previous connectionist models, particularly the broad class of models known as interactive activation with competition (IAC) models Nakisa ~ Plunkett 72 (see Grossberg (1978) for review). An IAC network consists of a collection of processing units divided into several competitive pools. Within pools there are inhibitory connections and between pools there are excitatory connections. Connections are interactive because one pool interacts with other pools and in turn is affected by those pools. Because of these interactions the activity of units in IAC networks develop over time, sometimes settling into steady patterns of activation. Inhibitory connections within a pool mean that one unit at a time dominates the others in a winner-take-all fashion. The TRACE model of speech perception is possibly the most successful and best known example of such models (McClelland &amp; Elman, 1986).</Paragraph>
    <Paragraph position="1"> Although similar to IAC networks, the models described here have three major modifications: Learning Each network learns using n~any different, unsupervised learning rules. These use only local information, and so are biologically plausible. null Flexible Architecture Every network is split into a number of separate subnetworks. This allows exploration of different neuronal architectures, and it becomes possible to use different learning rules to connect subnetworks. Subnetworks differ in their &amp;quot;time-constants&amp;quot; i.e. respond to information over different time-scales.</Paragraph>
    <Paragraph position="2"> Genetic Selection Networks are evolved using a technique called genetic connectionism (Chalmers, 1990). Using a genetic algorithm allows great flexibility in the type of neural network that can be used. All the attributes of the neural network can be simultaneously optimised rather than just the connections. In this model the architecture, learning rules and timeconstants are all optimised together.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Genome Design and Sexual
Reproduction
</SectionTitle>
      <Paragraph position="0"> The genome has been designed to have two chromosomes stored as arrays of numbers. One chromosome stores the attributes of each subnetwork, such as the number of units in the subnetwork, the subnetwork time constant and the indices of the other subnetworks to which the subnetwork projects. The other chromosome stores learning rules which are used to modify connections between individual units.</Paragraph>
      <Paragraph position="1"> During sexual reproduction of two networks the two chromosomes from each parent are independently recombined. In recombination, a point within a chromosome array is randomly chosen, and all the Evolution of Speech Representations information up to that point is copied from the paternal chromosome and the rest of the chromosome is copied from the maternal chromosome creating a hybrid chromosome with information from both parents. Clearly, the subnetwork and learning rule chromosomes must be the same length for sexual recombination to occur, so not all pairs of parents can reproduce. Parents must be sexually compatible i.e. must have the same number of subnetworks and learning rules.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Dynamics
</SectionTitle>
      <Paragraph position="0"> The dynamics of all units in the network are governed by the first order equation da~ ~ s~n s vn dt = 2-~wiJ aj-a~ (1) s,j Where v,~ is the time constant for subnetwork n, a~ is the activity of the jth unit in subnetwork s, a~ is the activity of the i th unit in subnetwork n, u~.~&amp;quot; is the synaptic strength between the jth unit in sub-network s and the i th unit in subnetwork n. In other words, the rate of change in the activation of a unit is a weighted sum of the activity of the units which are connected to the unit i, minus a decay term. If there is no input to the unit its activity dies away exponentially with time constant r~. The activity of a unit will be steady when the activity of the unit is equal to its net input. Activities were constrained to lie in the range 0.0, &lt; a &lt; 1.0. Network activity for all the units was updated in a synchronous fashion with a fixed time-step of 10 ms using a fourth order Runge-Kutta integration scheme adapted from Numerical Recipes (Press, Flannery, Teukolsky, &amp; Vetterling, 1988).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Architecture
</SectionTitle>
      <Paragraph position="0"> Architecture defines the gross pattern of connectivity between groups of units. The architecture has to be stored in a &amp;quot;genome&amp;quot; to allow it to evolve with a genetic algorithm, and one very flexible method of encoding the architecture is to create a subnetwork connectivity matrix. If there are n subnetworks in the network, then the subnetwork connectivity matrix will be an n by n matrix. The column number indicates the subnetwork from which connections project, and the row number indicates the subnetworks to which connections project.</Paragraph>
      <Paragraph position="1"> Complex architectures can be represented using a subnetwork connectivity matrix. The matrix allows diagonal elements to be non-zero, allowing a subnetwork to be fully connected to itself. In addition, the subnetwork connectivity matrix is used to determine which learning rule will be used for the connections between any pair of subnetworks. If an element is zero there are no connections between two subnetworks. A positive integer element indicates that subnetworks are fully connected and the value of the integer specifies which one of the many learning rules to use for that set of connections. A simple architecture is shown in Figure 2 alongside its corresponding subnetwork connectivity matrix.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Learning Rules
</SectionTitle>
      <Paragraph position="0"> Learning rules are of the general form shown in equation 2. They are stored in the network genome in groups of seven coefficients k0 to k6 following the</Paragraph>
      <Paragraph position="2"> In Equation 2, wij is the change in synaptic strength between units j and i, l is the learning rate, ai is the activity of unit i, aj is the activity of unit j and wlj is the current synaptic strength between units j and i. The learning rate l is used to scale weight changes to small values for each time step to avoid undesirably rapid weight changes. The coefficients in this equation determine which learning rule is used. For example, a Hebbian learning rule would be represented in this scheme with k3 &gt; 0 and k0 &lt; 0and kl = k2 = k4 = k5 = k6 = 0. Connections between units using this learning rule would be strengthened if both units were simultaneously active. A network has several learning rules in its genome stored as a set of these coefficients. Weight values are clipped to avoid extremely large values developing over long training periods. The range used was -1.0 &lt; wij &lt; +1.0.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Training and Evaluation of Fitness
</SectionTitle>
      <Paragraph position="0"> Networks were trained and evaluated using digitised speech files taken from the DARPA TIMIT</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Acoustic-Phonetic Continuous Speech Corpus
</SectionTitle>
      <Paragraph position="0"> (TIMIT) as described in Garofolo et al. (1990).</Paragraph>
      <Paragraph position="1"> All networks were constrained to have 64 input units because speech sounds were represented as power spectra with 64 values. This was an artificial constraint imposed by the format of the spectra.</Paragraph>
      <Paragraph position="2"> The power spectra were calculated with the OGI speech tools program MAKEDFT 1 (modified to produce the correct output format) with a window size of 10 ms and with successive windows adjacent to one another. For these simulations 8 output subnetworks were used to represent features because</Paragraph>
      <Paragraph position="4"> corresponding subnetwork connectivity matrix. Subnetwork 1 and 2 are the input and output subnetworks, respectively. Arrows represent sets of connections and the type of learning rule employed by those sets of connections. There are three learning rules used; solid arrow (learning rule 1~, dashed arrow (learning rule 2) and dotted arrow (learning rule 3). Some subnetworks are fully connected to themselves, such as subnetwork 8 (since C88 = 1), while others are information way-stations, such as subnetwork 5 (C55 = 0).</Paragraph>
      <Paragraph position="5"> this is roughly the number claimed to be necessary for distinguishing all human speech sounds by some phoneticians (Jakobson ~ Waugh, 1979).</Paragraph>
      <Paragraph position="6"> All the connections, both within and between subnetworks, were initialised with random weights in the range -1.0 to +1.0. Networks were then exposed to a fixed number of different, randomly selected training sentences (usually 30). On each time-step activity was propagated through the network of sub-networks to produce a response activity on the output units. All connections were then modified according to the learning rules specified in the genome.</Paragraph>
      <Paragraph position="7"> On the next time-step a new input pattern corresponding to the next time-slice of the speech signal was presented and the process of activity propagation and weight modification repeated. The process of integrating activities and weight updates was repeated until the network had worked its way through all the time-slices of each sentence.</Paragraph>
      <Paragraph position="8"> In the testing phase activation was propagated through the network without weight changes* The weights were frozen at the values they attained at the end of the training phase. Testing sentences were always different from training sentences* When a time-slice corresponded with the mid-point of a phoneme, as defined in the TIMIT phonological transcription file, the output unit activities were stored alongside the correct identity of the phoneme.</Paragraph>
      <Paragraph position="9"> Network fitness was calculated using the stored output unit activities after the network had been exposed to all the testing sentences. The fitness function f was</Paragraph>
      <Paragraph position="11"> Where s = -4-1 if i and j are different phonemes and s = -1 if i and j are the identical phonemes, 5~ and 5~ were the output unit activities at the mid-point of all N phonemes and dist was euclidean distance. This fitness function favoured networks that represented occurrences of the same phoneme as similarly as possible and different phonemes as differently as possible. A perfect network would have all instances of a given phoneme type mapping onto the same point in the output unit space and different phonemes as far apart as possible. Note that constant output unit activities would result in a fitness of 0.0. An ideal learning rule would be able to find an appropriate set of weights whatever the initial starting point in weight space. Each network was trained and tested three times from completely different random initial weights on completely different sentences. This reduced random fitness variations caused by the varying difficulty of training/testing sentences and the choice of initial weights.</Paragraph>
      <Paragraph position="12"> Evolution was carried out with a population of 50 networks. Genomes were initially generated with certain limits on the variables. All genomes had 16 input subnetworks and 8 output subnetworks with time constants randomly distributed in the range 100 ms to 400 ms. The input subnetworks had 4 units each and the output subnetworks had 1 unit each. Each network started with 10 different learning rules with integer coefficients randomly distributed in the range -2 to --t-2. Subnetwork con-</Paragraph>
      <Paragraph position="14"> nectivity matrices were generated with a probability of any element being non-zero of 0.3. If an element was non-zero, the learning rule used for the connections between the subnetworks was randomly selected from the 10 learning rules defined for the network. The networks were also constrained to be feed-forward, as shown in Figure 3.</Paragraph>
      <Paragraph position="15">  with no &amp;quot;hidden&amp;quot; units and a fixed number of input units (64) and output units (S). Input units were grouped into subnets of 4 units each and each input unit carried information from one of the 64 frequency values in the speech spectra ranging from 0 to 8 kHz.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML