File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1209_metho.xml
Size: 3,273 bytes
Last Modified: 2025-10-06 14:15:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1209"> <Title>Knowledge Extraction and Recurrent Neural Networks: An Analysis of an Elman Network trained on a Natural Language Learning Task</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Discussion </SectionTitle> <Paragraph position="0"> Although an Elman network with two hidden units could learn only 51% of the training data, nevertheless graphical analysis reveals hierarchical clustering of hidden unit activations. There are dusters associated with each of the ten word categories (Figure I), although clusters associated with low frequency inputs such as AR, CC and JJ tend to overlap. Clusters labeled with the high frequency inputs revealed obvious sub-clusters and sub-sub-clusters such as those shown in Figure 2. In other words, the network had acquired internal representations of temporal sequences of at least length 3. However because the hidden unit space had such low dimensionality, it could not be partitioned by the output layer to achieve accurate prediction.</Paragraph> <Paragraph position="1"> An FSA with 18 states derived from k-means clustering of hidden unit activations scored 60% on the original training data (Table 2). This compares with a score of 69% by the original network and a score of 62% when predicting with a trigram model. Although in theory, the trigram model requires the calculation of Schellhammer, Diederich, Towsey and Brugman 77 Knowledge Extraction and Recurrent Neural Nets I0,000 transition probabilities, it reduces to 42 transition rules. This compares with of 62 transitions rules incorporated into the 18-state FSA. Thus the trigram model is a more compact definition of the grammar. However, low frequency transitions can be trimmed fi'om the FSA's with minimal loss of performance as is demonstrated for the I 0-state FSA in Table 3.</Paragraph> <Paragraph position="2"> Correct choice of the 'rescue' state is important for the efficient performance of an FSA because it determines the FSA's ability to pick up the sentence structure at~er a missing transition. In order to automate the production of FSA's following cluster analysis, we require a heuristic for the choice of 'rescue' state. Our initial choice, that state whose inputs on average are closest to the beginning of the sentence, seems to be a reasonable heuristic in the absence of other information (Table 4). Likewise, choosing the highest frequency category (in our case, NN) as the 'rescue' output is also confirmed by our results because the FSA scores achieved on non-missing transitions are not much better than the total scores, despite 10-20% of missing transitions (Table 4).</Paragraph> <Paragraph position="3"> The use of domain knowledge, such as category frequencies, to bias weight initialisation is successful in reducing error faster in the early stages of learning. Of course if training is continued for long enough, then any memory of the initial bias will be erased. Best results were achieved when five links were set (such that no hidden unit had a set link to both the NN and/S inputs) or eight links were set (such that only five hidden units had set links to both the/fiN and/S inputs).</Paragraph> </Section> class="xml-element"></Paper>