File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1202_metho.xml

Size: 13,597 bytes

Last Modified: 2025-10-06 14:15:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1202">
  <Title>Natural Language Learning by Recurrent Neural Networks: A Comparison with probabilistic approaches</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1. Learning by Eiman networks
</SectionTitle>
      <Paragraph position="0"> An Elman network having 9 hidden units and trained for 100,000 epochs was able to learn 72% of the total data.</Paragraph>
      <Paragraph position="1"> However using a &amp;quot;leave-one-sentence-out&amp;quot; 106-fold cross-validation technique, the best generalisation result following fast training was 63%. Figure 1 shows the fraction of the training data learned from I0 to 100,000 epochs of slow training. Early learning appears to proceed in discrete phases. In the first phase (up to 1000 epochs), the network predicts only NN, the category having highest frequency (30%). In phase 2 (1000 to 3000 epochs) the network predicts only NN or/S and scores 45% (the combined frequency of NN and/S is 47%). In phase 3 (3000 to 4000 epochs) the network predicts either NN,/S or VB, the three most common categories and at 5000 epochs it is predicting either NN, /S, VB orAIL The network's rms error with respect to the targets (labeled as &amp;quot;target error&amp;quot; in Figure 1) declined eontinuonsly during learning down to 0.160 at 80,000 epochs and increased slightly subsequently. It is also useful to measure the network's rms error with respect to n-gram probabilities on the assumption that the network should be learning n-gram probabilities with n increasing during training. These errors are referred to as bigram, trigram and 4-gram errors in Figure 1.</Paragraph>
      <Paragraph position="2"> Bigram error is initially less than trigram and 4-gram errors and declines most rapidly from 800 to 3000 epochs. It begins to increase again after 4000 epochs while trigram and 4-gram errors continue to decline.</Paragraph>
      <Paragraph position="3"> Al~er about 8,000 epochs, trigram error reaches a minimum value of 0.067 and then starts to increase. 4-gram error continue~ to decline to a value of 0.068 at 80,000 epochs after which it also starts to increase. 5-gram error (not shown in Figure 1 to preserve clarity) declines to a value of 0.076 at 100,000 iterations but is beginning to level out.</Paragraph>
      <Paragraph position="4"> To confirm that the Elman network is making predictions based on conditional probabilities and also to justify the calculation of output entropy as defined in the Methods section, we require that the sum of outputs should be close to 1.0. In Figure 2 it can be observed that from about 100 epochs, the average sum of outputs is indeed close to 1.0, although the standard deviation of the average sum increases from 0.02 at 100 epochs to 0.19 at 100,000 epochs. The entropy of the outputs (a measure of the network's 'uncertainty' about the next predicted category) declines as learning proceeds (Figure 2), but showing two 'fiat' periods corresponding to 'fiat' periods in target error.</Paragraph>
      <Paragraph position="5"> 3.2. Comparison of Eiman and RCC networks When trained on the set of 485 training patterns, the RCC network continued to add hidden units and was able to learn 99.6% of patterns after adding 42 hidden units (Figure 3). However a maximum generalisation of 63% on the test set was achieved after only 4 hidden units and generalisation declined with further addition of hidden units (Figure 3). By contrast, when Elman networks with 1-50 hidden units were trained on the same data, there was no simple recognisable relationship between generalisation and hidden layer size. An Ehnan network with 4 hidden units scored 60% on the test set, 3% lower than an RCC net of the same size. An Elman network with 9 hidden units scored 64%. However the best generalisation score of 68% was achieved with 42 hidden units.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Predietio n~U ncertainty
</SectionTitle>
      <Paragraph position="0"> Figure 4 shows a graph of prediction uncertainty (measured as the entropy of the output units) over a part of the sequence of category targets. Each point is labeled with the target category. Highest entropy always occurs when the input is the first VB in the sentence. An increase in entropy is also associated with the first category in the sentence. By contrast there is a low entropy associated with the prediction of sentence termination, 89% of sentence endings being correctly predicted.</Paragraph>
      <Paragraph position="1"> 3.4. Correctly predicted sequences It is possible to reconstruct the sequences correctly learned by the Elman network that had learned 72% of the training set. They are shown in Figure 5. The transitions marked with an asterix (&gt;*) are those not predicted by trigram probabilities. Sequences 1 and 5 include complete and grammatical sentence structures.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="6" type="metho">
    <SectionTitle>
4. Discussion
</SectionTitle>
    <Paragraph position="0"> 4.1. Training the Elman network After 10 iterations, the network was predicting the NN category for every pattern. Since NN was the highest frequency category, this was the quickest way for the network to reduce its initial prediction error. It could be said that the network was performing equivalently to a unigram predictor.</Paragraph>
    <Paragraph position="1"> Towsey, Diederich, Schellhammer, Chalup, Brugman 5 Natural Language Learning by Recurrent Neural Nets  progression of r.m.s, error with respect to the targets, bigram and trigram probabilities and the fraction of training set learned as a function of training epochs.</Paragraph>
    <Paragraph position="3"> &gt;* are not predicted by trigram probabilities.</Paragraph>
    <Paragraph position="4"> Although VB has the second highest frequency (higher than/S), during the second learning phase the network outputs were confined to NN or/S (not V'B).</Paragraph>
    <Paragraph position="5"> This is because the network was beginning to learn bigram probabilities and an inspection of the bigrarn frequency table revealed that there were 100 instnnees of/S prediction using bigram probabilities but only 35 instances of predicting a VB. It is during this second learning phase that the bigram error decreases most rapidly.</Paragraph>
    <Paragraph position="6"> In phase 3, the VB category is added to the network's prediction capability and in phase 4, AR is added, there being only 19 instances where AR would be predicted using bigram probabilities. In fact, using bigram probabilities only these four categories (NN,/S, VB and AR) can be predicted. At~er about 5000 epochs the network was also correctly predicting other categories, which indicates that it was making predictions based on the current and previous inputs. And indeed we observe /.hat the trigram error rate falls below the bigram error rate around 5000 epochs (Figure 1).</Paragraph>
    <Paragraph position="7"> In Figure 5 it is apparent that the network has correctly learned category transitions that are not predicted by trigram probabilities. This is indicative that the network was using at least the current and two previous inputs as context for its decisions. In fact 4-gram error continues to decline up to 80,000 epochs.</Paragraph>
    <Paragraph position="8"> Since the average sentence length is 5.05 words, it is not surprising that the 5-gram error remains above 4-grarn error throughout learning.</Paragraph>
    <Paragraph position="9"> Of course it is not being suggested here, that a recurrent network is first learning all the probabilities of a bigram model and then moves on to learn the trigram model and so on. Network learning is driven by the requirement to minimise predictive error. Thus longer sequences having high frequency will bias learning more than infrequently occurring short sequences.</Paragraph>
    <Paragraph position="10"> Nevertheless an interesting feature of learning apparent in Figure 1 was that minimum bigram error was achieved at 4000 epochs when the network had learned 48% of the training set, equivalent to the performance of a bigram predictor. Similarly minimum trigrarn and 4-gram error was achieved when the network had learned the equivalent of a lrigram and 4-gram predictor respectively.</Paragraph>
    <Paragraph position="11"> Mention should be made of the decision not to reset state unit activations to zero when the Elman network encountered d sentence boundary. When resets were Towsey, Diederich, Schellharamer, Chalup, Brugman 8 Natural Language Learning by Recurrent Neural Nets  used, network predictive performance dropped from 70% to 69% with otherwise similar training regimes. In other words, there was minimal information transfer over sentence boundaries and it is more interesting to observe this aspect of network learning than to impose 'forgetting' artifieiaUy. The slight increase in performance without resets was probably due to the repetitive nature of the sentences in this text meant for early readers. An additional reason for not using resets was that it made comparisons of network performance with n-gram statistics easier.</Paragraph>
    <Paragraph position="12"> 4.2. Comparison of Eiman network and RCC nets Although the RCC net was capable of learning almost the entire training set, the hidden unit representations that the network acquired did not generalise well. On the other hand, the best generalising RCC net with four hidden units did better than an Elman network with the same number of bidden units. Due to different learning algorithms, the two networks presumably acquired different hidden unit representations of the underlying task. It is clear that, for this task at least, training an RCC net to find the optimum number of hidden units for an Elman network is not a satisfactory technique.</Paragraph>
    <Paragraph position="13"> The maximum RCC score on the test set of 63% was, in fact, an unexpectedly high score. A bigram model acquired from the training set of 80 sentences, predicted 48% and 45% of the training and test set words respectively. The equivalent scores for the trigram model were 63% and 17% respectively. The poor generalisation of the n-gram models for n &gt; 2 arose because the test sequences did not have the same statistical slructure as the training sequences for n &gt; 2. This is the consequence of using natural language sentences and converting the words to lexical categories. A similar difficulty was noted by Lawrence et al (1996) for their NL task which required recurrent networks to classify sentences as either grammatical or ungrammatical.</Paragraph>
    <Paragraph position="14"> The experimental paradigm used in our experiment demands alternative measures of generalisation. These might include (1) testing on an artificially generated sequence that has the same n-gram (statistical) structure as the NL training sequence (2) testing on the training sequence corrupted with output noise (3) testing on the training sequence but with the sentences in random order. This last is appropriate where resets are not used during training. Such alternative tests of generalisation will be considered in future work.</Paragraph>
    <Section position="1" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.3. Prediction Uncertainty
</SectionTitle>
      <Paragraph position="0"> Elman (1990) found that when a recurrent net was trained on letter sequences consisting of concatenated words, its prediction error tended to decrease from beginning to end of each word. Thus a sharp increase in prediction error could be used to segment the letter sequence into words.</Paragraph>
      <Paragraph position="1"> In our study, there was low entropy associated with end-of-sentence prediction, 89% of/S being correctly predicted. Furthermore, when the input was/S, output entropy increased in 84% of cases. However by far the most obvious increase in prediction uncertainty occurred when the input was the first VB of the sentence (Figure 4).</Paragraph>
      <Paragraph position="2"> We should not expect that prediction uncertainty will decrease from beginning to end of a sentence in the same way that it does for words, because the rules which govern word structure are different from those which govern sentence structure. For example, the inventory of units that makes up words is so much smaller and the articulation of phoneme sequences is more highly constrained. It is not surprising therefore, to find that in our task, a sharp increase in the network's prediction uncertainty occurs other than when it encounters a sentence boundary.</Paragraph>
      <Paragraph position="3"> The first VB in our tagging system was either an auxiliary, or modal or the verb itself, if there was no auxiliary. In other words, the first VB has the largest number of highly probable successors. A linguistic interpretation of the network behaviour is complicated by the small number of lexical categories used in the study. Ira more fine-grained system of tagging had been used, the progression of prediction uncertainty through the sentences would have been different. All the sentences in the text consisted of single clauses and the network behaviour is consistent with the verb being the most important determinant of sentence or clause structure.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML