File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1021_metho.xml

Size: 23,904 bytes

Last Modified: 2025-10-06 14:08:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1021">
  <Title>Training Connectionist Models for the Structured Language Model a0</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A Probabilistic Neural Network Model
</SectionTitle>
    <Paragraph position="0"> Recently, a relatively new type of language model has been introduced where words are represented by points in a multi-dimensional feature space and the probability of a sequence of words is computed by means of a neural network. The neural network, having the feature vectors of the preceding words as its input, estimates the probability of the next word (Bengio et al., 2001). The main idea behind this model is to fight the curse of dimensionality by interpolating the seen sequences in the training data. The generalization this model aims at is to assign to an unseen word sequence a probability similar to that of a seen word sequence whose words are similar to those of the unseen word sequence. The similarity is defined as being close in the multi-dimensional space mentioned above.</Paragraph>
    <Paragraph position="1"> In brief, this model can be described as follows.</Paragraph>
    <Paragraph position="2"> A feature vector is associated with each token in the input vocabulary, that is, the vocabulary of all the items that can be used for conditioning. Then the conditional probability of the next word is expressed as a function of the input feature vectors by means of a neural network. This probability is produced for every possible next word from the output vocabulary. In general, there does not need to be any relationship between the input and output vocabularies.</Paragraph>
    <Paragraph position="3"> The feature vectors and the parameters of the neural network are learned simultaneously during training.</Paragraph>
    <Paragraph position="4"> The input to the neural network are the feature vectors for all the inputs concatenated, and the output is the conditional probability distribution over the output vocabulary. The idea here is that the words which are close to each other (close in the sense of their role in predicting words to follow) would have similar (close) feature vectors and since the probability function is a smooth function of these feature values, a small change in the features should only lead to a small change in the probability.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Architecture of the Neural Network
Model
</SectionTitle>
      <Paragraph position="0"> The conditional probability function a0a2a1a4a3a6a5a7a6a8a10a9a11a7a13a12a14a9a16a15a16a15a16a15a10a9a11a7a18a17a20a19a21a8a23a22 where a7a25a24 and a3 are from the input and output vocabularies a26 a24 and a26a18a27 respectively, is determined in two parts:  1. A mapping that associates with each word in the input vocabulary a26 a24 a real vector of fixed length a28 2. A conditional probability function which takes  as the input the concatenation of the feature vectors of the input items a7a29a8a16a9a11a7a25a12a14a9a16a15a16a15a16a15a30a9a11a7a18a17a20a19a21a8 . The function produces a probability distribution (a vector) over a26 a27 , the a31a20a32a34a33a36a35 element being the conditional probability of the a31a10a32a37a33a36a35 member of a26 a27 . This probability function is realized by a standard multi-layer neural network. A softmax function (Equation 4) is used at the output of the neural net to make sure probabilities sum to 1.</Paragraph>
      <Paragraph position="1"> Training is achieved by searching for parameters a38 of the neural network and the values of feature vectors that maximize the penalized log-likelihood of the training corpus:</Paragraph>
      <Paragraph position="3"> size and a75 a1a76a38a77a22 is a regularization term, sum of the parameters' squares in our case.</Paragraph>
      <Paragraph position="4"> The model architecture is given in Figure 1. The neural network is a simple fully connected network with one hidden layer and sigmoid transfer functions. The input to the function is the concatenation of the feature vectors of the input items. The output of the output layer is passed though a softmax to</Paragraph>
      <Paragraph position="6"> make sure that the scores are positive and sum up to one, hence are valid probabilities. More specifically, the output of the hidden layer is given by:</Paragraph>
      <Paragraph position="8"> a28are weight and bias elements for the hidden layer respectively, and a37 is the number of hidden units.</Paragraph>
      <Paragraph position="9"> Furthermore, the outputs are given by:</Paragraph>
      <Paragraph position="11"> are weight and bias elements for the output layer before the softmax layer. The softmax layer (equation 4) ensures that the outputs are positive and sum to one, hence are valid probabilities. The a30 a32 a33a36a35 output of the neural network, corresponding to the a30 a32 a33a36a35 item a3  of the output vocabulary, is exactly the sought conditional probability, that is a45</Paragraph>
      <Paragraph position="13"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Training the Neural Network Model
</SectionTitle>
      <Paragraph position="0"> Standard back-propagation is used to train the parameters of the neural network as well as the feature vectors. See (Haykin, 1999) for details about neural networks and back-propagation. The function we try to maximize is the log-likelihood of the training data given by equation 1. It is straightforward to compute the gradient of the likelihood function for the feature vectors and the neural network parameters, and hence compute their updates.</Paragraph>
      <Paragraph position="1"> We should note from equation 4 that the neural network model is similar in functional form to the maximum entropy model (Berger et al., 1996) except that the neural network learns the feature functions by itself from the training data. However, unlike the G/IIS algorithm for the maximum entropy model, the training algorithm (usually stochastic gradient descent) for the neural network models is not guaranteed to find even a local maximum of the objective function.</Paragraph>
      <Paragraph position="2"> It is very important to mention that one of the great advantages of this model is that the number of inputs can be increased causing only sub-linear increase in the number of model parameters, as opposed to exponential growth in n-gram models. This makes the parameter estimation more robust, especially when the input span is long.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Structured Language Model
</SectionTitle>
    <Paragraph position="0"> An extensive presentation of the SLM can be found in (Chelba and Jelinek, 2000). The model assigns a probability a0a2a1 a35 a9a8a48a37a22 to every sentence a35 and every possible binary parse a48 . The terminals of a48 are the words of a35 with POS tags, and the nodes of a48 are annotated with phrase headwords and non-terminal labels. Let a35 be a sentence of length a49</Paragraph>
    <Paragraph position="2"> be the word a30 -prefix of the sentence -- the words from the beginning of the sentence up to the current position a30 -- and a35</Paragraph>
    <Paragraph position="4"> the word-parse a30 -prefix. Figure 2 shows a word-parse a30 -prefix; h_0, .., h_{-m} are the exposed heads, each head being a pair (headword, non-terminal label), or (word, POS tag) in the case of a root-only tree. The exposed heads at a given position a30 in the input sentence are a function of the word-parse a30 -prefix.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Probabilistic Model
</SectionTitle>
      <Paragraph position="0"> The joint probability a0a2a1 a35  carried out at position k in the word string; the operations performed by the CONSTRUCTOR ensure that all possible binary branching parses, with all possible headword and non-terminal label assignments for the a50 a8a6a72a16a72a16a72 a50  word sequence, can be generated. The a45 a28a8 a72a16a72a16a72 a45 a28a4 a1 sequence of CONSTRUC-</Paragraph>
      <Paragraph position="2"> The SLM is based on three probabilities, each can be specified using various smoothing methods and parameterized (approximated) by using different contexts. The bottom-up nature of the SLM parser enables us to condition the three probabilities on features related to the identity of any exposed head and any structure below the exposed head.</Paragraph>
      <Paragraph position="3"> Since the number of parses for a given word prefix</Paragraph>
      <Paragraph position="5"> the state space of our model is huge even for relatively short sentences, so we have to use a search strategy that prunes it. One choice is a synchronous multi-stack search algorithm (Chelba and Jelinek, 2000) which is very similar to a beam search.</Paragraph>
      <Paragraph position="6"> The language model probability assignment for the word at position a30a16a15 a1 in the input sentence is  which ensures a proper probability normalization over strings a35a30a29 , where a31  is the set of all parses present in our stacks at the current stage a30 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 N-best EM Training of the SLM
</SectionTitle>
      <Paragraph position="0"> Each model component of the SLM --WORD-PREDICTOR, TAGGER, CONSTRUCTOR-- is initialized from a set of parsed sentences after undergoing headword percolation and binarization. An N-best EM (Chelba and Jelinek, 2000) variant is then employed to jointly reestimate the model parameters such that the PPL on training data is decreased --the likelihood of the training data under our model is increased. The reduction in PPL is shown experimentally to carry over to the test data.</Paragraph>
      <Paragraph position="1"> Let a1 a35 a9a8a48 a22 denote the joint sequence of a35 with parse structure a48 . The probability of a a1 a35 a9a8a48a37a22 sequence a0a2a1 a35 a9a8a48 a22 is, according to Equation 5, the product of the corresponding elementary events.</Paragraph>
      <Paragraph position="2"> This product form makes the three components of the SLM separable, therefore, we can estimate the parameters separately. According to the EM algorithm, the auxiliary function can be written as:</Paragraph>
      <Paragraph position="4"> previous iteration, the M step is to find parameters a42 that maximize the auxiliary function a43 a1 a42 a9 a40a42 a22 above. In practice, since the space of a48 , all possible parses, is huge, we normally use a synchronous multi-stack search algorithm to sample the most probable  parses and approximate the space by the N-best parses. (Chelba and Jelinek, 2000) showed that as long as the N-best parses remain invariant, the M step will increase the likelihood of the training data.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Neural Network Models in the SLM
</SectionTitle>
    <Paragraph position="0"> As described in the previous section, the three components of the SLM can be parameterized in various ways. The neural network model, because of its ability in fighting the data sparseness problem, is a very natural choice when we want to use longer contexts to improve the language model performance.</Paragraph>
    <Paragraph position="1"> The training criterion for the neural network model is given by Equation 1 , when we have labeled training data for the SLM. The labels --the parse structure-- are used to get the conditioning variables. In order to take advantage of the ability of the SLM in generating many hidden parses, we need to modify the training criterion for the neural network model. Actually, if we take the EM auxiliary function in Equation 7 and find parameters of the neural network models to maximize a43 a1 a42 a9 a40a42 a22 , the solution will be very simple. When standard back-propagation is used to optimize Equation 1, the derivative of a0 with respect to the parameters is calculated and used as the direction for the gradient descent algorithm. Since a43 a1 a42 a9 a40a42 a22 is nothing but a weighted average of the log-likelihood functions, the derivative of a43 with respect to the parameters is then a weighted average of the derivatives of the log-likelihood functions. In practice, we use the SLM with all components modeled by neural networks to generate N-best parses in the E step, and for the M step, we use the modified back-propagation algorithm to estimate the parameters of the neural network models based on the weights calculated in the E step.</Paragraph>
    <Paragraph position="2"> We should be aware that there is no proof that this EM procedure can actually increase the likelihood of the training data. Not only are we using a small portion of the entire hidden parse space, but we also use the stochastic gradient descent algorithm that is not guaranteed to converge, for training the neural network models. Bearing this in mind, we will show experimentally that this flawed EM procedure can still lead to improvements in PPL.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We have used the UPenn Treebank portion of the WSJ corpus to carry out our experiments. The UPenn Treebank contains 24 sections of hand-parsed sentences. We used section 00-20 for training our models, section 21-22 for tuning some parameters (i.e., estimating discount constant for smoothing, and/or making sure overtraining does not occur) and section 23-24 to test our models. Before carrying out our experiments, we normalized the text in the following ways: numbers in Arabic form are replaced by a single token &amp;quot;N&amp;quot;, punctuations are removed, all words are mapped to lower case, extra information in the parse (such like traces) are ignored.</Paragraph>
    <Paragraph position="1"> The word vocabulary contains 10k words including a special token for unknown words. There are 40 items in the part-of-speech set and 54 items in the non-terminal set, respectively. All of the experimental results in this section are based on this corpus and split, unless otherwise stated.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Getting a Better Baseline
</SectionTitle>
      <Paragraph position="0"> Since better performance of the SLM was reported recently in (Kim et al., 2001) by using Kneser-Ney smoothing, we first improved the baseline model by using a variant of Kneser-Ney smoothing: the interpolated Kneser-Ney smoothing as in (Goodman, 2001), which is also implemented in the SRILM toolkit (Stolcke, 2002).</Paragraph>
      <Paragraph position="1"> There are three notable differences in our implementation of the interpolated Kneser-Ney smoothing related to that in the SRILM toolkit. First, we used one discount constant for each n-gram level, instead of three different discount constants. Second, our discount constant was estimated by maximizing the log-likelihood of the heldout data (assuming the discount constant is between 0 and 1), instead of the Good-Turing estimate. Finally, in order to deal with the fractional counts we encounter during the EM training procedure, we developed an approximate Kneser-Ney smoothing for fractional counts. For lack of space, we do not go into the details of this approximation, but our approximation becomes the exact Kneser-Ney smoothing when the counts are integers. null In order to test our Kneser-Ney smoothing implementation, we built a trigram language model and compared the performance with that from the SRILM. Our PPL was 149.6 and the SRILM PPL was 148.3, therefore, although there are differences in the implementation details, we think our result is close enough to the SRILM.</Paragraph>
      <Paragraph position="2"> Having tested the smoothing method, we applied it to the SLM. We used the Kneser-Ney smoothing to all components with the same parameterization as the h-2 scheme in (Xu et al., 2002). Table 1 is the comparison between the deleted-interpolation (DI) smoothing and the Kneser-Ney (KN) smoothing. The a1 in Table 1 is the interpolation weight between the SLM and the trigram language model (a1 =1.0 being the trigram language model). The notation &amp;quot;En&amp;quot; indicates the models were obtained after &amp;quot;n&amp;quot; iterations of EM training1. Since Kneser-Ney smoothing is consistently better than deletedinterpolation, we later on report only the Kneser-Ney smoothing results when comparing to the neural network models.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Training Neural Network Models with the
Treebank
</SectionTitle>
      <Paragraph position="0"> We used the neural network models for all of the three components of the SLM. The neural network models are exactly as described in Section 2.1. Since the inputs to the networks are always a mixture of words and NT/POS tags, while the output probabilities are over words in the PREDICTOR, POS tags in the TAGGER, and adjoint actions in the PARSER, we used separate input and output vocabularies in all cases. In all of our experiments with the neural network models, we used 30 dimensional feature vectors as input encoding of the mixed items, 100 hidden units and a starting learning rate of 0.001.</Paragraph>
      <Paragraph position="1"> Stochastic gradient descent was used for training the models for a maximum of 50 iterations. The initialization for the parameters is done randomly with a uniform distribution centered at zero.</Paragraph>
      <Paragraph position="2"> In order to study the behavior of the SLM when longer context is used for conditioning the probabilities, we gradually increased the context of the PREDICTOR model. First, the third exposed previous head was added. Since the syntactical head gets the head word from one of the children, either left or right, the child that does not contain the head word (hence called opposite child) is never used later on in predicting. This is particularly not appropriate for the prepositional phrase because the preposition is always the head word of the phrase in the UPenn Treebank annotation. Therefore, we also added the opposite child of the first exposed previous head into the context for predicting. Both Kneser-Ney smoothing and the neural network model were studied when the context was gradually increased. The results are shown in Table 2.</Paragraph>
      <Paragraph position="3"> In Table 2, &amp;quot;nH&amp;quot; stands for &amp;quot;n&amp;quot; exposed previous heads are used for conditioning in the PREDICTOR component, &amp;quot;nOP&amp;quot; stands for &amp;quot;n&amp;quot; opposite children are used, starting from the most recent one. As we can see, when the length of the context is increased,  Kneser-Ney smoothing saturates quickly and could not improve the PPL further. On the other hand, the neural network model can still consistently improve the PPL, as longer context is used for predicting. Overall, the best neural network model (after interpolation with a trigram) achieved 8% relative improvement over the best result from Kneser-Ney smoothing.</Paragraph>
      <Paragraph position="4"> Another interesting result is that it seems the neural network model can learn a probability distribution that is less correlated to the normal trigram model. Although before interpolating with the trigram, the PPL results of the neural network models are not as good as the Kneser-Ney smoothed models, they become much better when combined with the trigram. In the results of Table 2, the trigram model is a Kneser-Ney smoothed model that gave PPL of 149.6 by itself. The interpolation weight with the tri-gram is 0.4 and 0.5 respectively, for the Kneser-Ney smoothed SLM and neural network based SLM.</Paragraph>
      <Paragraph position="5">  To better understand why using the neural network models can result in such behavior, we should look at the difference between the training PPL and test PPL. Figure 3 shows the ratio between the test PPL and train PPL. We can see that for the neural network models, the ratios are much smaller than that for the Kneser-Ney smoothed models. Furthermore, as the length of context increases, the ratio for the Kneser-Ney smoothed model becomes greater -- a clear sign of over-parameterization. However, the ratio for the neural network model changes very little even when the length of the context increases from 4 (2H) to 8 (3H-1OP). The exact reason why the neural network models are more uncorrelated to the trigram is not completely understood, but we conjecture that part of the reason is that the neural network models can learn a probability distribution very different from the trigram by putting much less probability mass on the training examples.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5.3 Training the Neural Network Models with
EM
</SectionTitle>
    <Paragraph position="0"> After the neural network models were trained from the labeled data --the UPenn Treebank-- we performed one iteration of the EM procedure described in Section 4. The neural network model based SLM was used to get N-best parses for each training sentence, via the multi-stack search algorithm. This E step provided us a bigger collection of parse structures with weights associated with them. In the next M step, we used the stochastic gradient descent algorithm (modified to utilize the weights associated with each parse structure) to train the neural network models. The modified stochastic gradient descent algorithm was run for a maximum of 30 iterations and the initial parameter values are those from the the previous iteration.</Paragraph>
    <Paragraph position="1">  Table 3 shows the PPL results after one EM training iteration for both the neural network models and the approximated Kneser-Ney smoothed models, compared to the results before EM training.</Paragraph>
    <Paragraph position="2"> For the neural network models, the EM training did improve the PPL further, although not a lot. The improvement from training is consistent with the training results showed in (Xu et al., 2002) where deleted-interpolation smoothing was used for the SLM components. It is worth noting that the approximated Kneser-Ney smoothed models could not improve the PPL after one iteration of EM training.</Paragraph>
    <Paragraph position="3"> One possible reason is that in order to apply Kneser-Ney smoothing to fractional counts, we had to approximate the discounting. The approximation may degrade the benefit we could have gotten from the EM training. Similarly, the M step in the EM procedure for the neural network models also has the same problem: the stochastic gradient descent algorithm is not guaranteed to converge. This can be clearly seen in Figure 4 in which we plot the learning curves of the 3H-1OP model (PREDICTOR component) on both training and heldout data at EM iteration 0 and iteration 1. For EM iteration 0, because we started from parameters drawn from a uniform distribution, we only plot the last 30 iterations of the stochastic gradient descent.</Paragraph>
    <Paragraph position="4">  As we expected, the learning curve of the training data in EM iteration 1 is not as smooth as that in EM iteration 0, and even more so for the heldout data. However, the general trend is still decreasing. Although we can not prove that the EM training of the neural network models via the SLM can improve the PPL, we observed experimentally a gain that is favorable comparing to that from the usual Kneser-Ney smoothed models or deleted interpolation models. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML