File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1028_metho.xml
Size: 9,037 bytes
Last Modified: 2025-10-06 14:13:08
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1028"> <Title>PARAMETER ESTIMATION FOR CONSTRAINED CONTEXT-FREE LANGUAGE MODELS</Title> <Section position="4" start_page="7962" end_page="7962" type="metho"> <SectionTitle> 2. STOCHASTIC CONTEXT-FREE GRAMMARS </SectionTitle> <Paragraph position="0"> A stochastic context-free grammar G is specified by the quintuple < VN, VT, R, S, P > where VN is a finite set of non-terminal symbols, VT is a finite set of terminal symbols, R is a set of rewrite rules, S is a start symbol in VN, and P is a parameter vector. If r 6 R, then Pr is the probability of using the rewrite rule r.</Paragraph> <Paragraph position="1"> For our experiments, we are using a 411 rule grammar which we will refer to as the Abney-2 grammar. The grammar has 158 syntactic variables, i.e., IVNI = 158.</Paragraph> <Paragraph position="2"> The rules of the Abney-2 grammar are of the form H -+ G1,G2 .... Gk where H, Gi 6 VN and k = 1,2 .....</Paragraph> <Paragraph position="3"> Hence, this grammar is not expressed in Chomsky Normal Form. We maintain this more general form for the purposes of linguistic analysis.</Paragraph> <Paragraph position="4"> An important measure is the probability of a derivation tree T. Using ideas from the random branching process literature \[2, 4\], we specify a derivation tree T by its depth L and the counting statistics zt(i,k),l = 1 .... ,n,i = 1 .... ,IVNI, and k = 1 ..... IRI. The counting statistic zz(i, k) is the number of non-terminals at 6 VN rewritten at level I with rule rk 6 R. With these statistics the probability of a tree T is given by</Paragraph> <Paragraph position="6"> In this model, the probability of a word string W1,N =</Paragraph> <Paragraph position="8"> where Parses(W1,N) is the set of parse trees for the given word string. For an unambiguous grammar, Parses(Wl,N) consists of a single parse.</Paragraph> </Section> <Section position="5" start_page="7962" end_page="7962" type="metho"> <SectionTitle> 3. PARAMETER ESTIMATION FOR SCFGS </SectionTitle> <Paragraph position="0"> An important problem in stochastic language models is the estimation of model parameters. In the parameter estimation problem for SCFGs, we observe a word string W1,N of terminal symbols. With this observation, we want to estimate the rule probabilities P. For a grammar in Chomsky Normal Form, the familiar Inside/Outside Algorithm is used to estimate P. However, the Abney-2 grammar is not in this normal form. Although the grammar could be easily converted to CNF, we prefer to retain its original form for linguistic relevance. Hence, we need an algorithm that can estimate the probabilities of rules in our more general form given above.</Paragraph> <Paragraph position="1"> The algorithm that we have derived is a specific case of Kupiec's trellis-based algorithm \[3\]. Kupiec's algorithm estimates parameters for general recursive transition networks. In our case, we only have rules of the following two types: 1. H ---~ G1G2&quot;..Gk where H, Gi E VN and k = 1,2 ....</Paragraph> <Paragraph position="2"> 2. H-+TwhereHEVN andTEVw.</Paragraph> <Paragraph position="3"> For this particular topology, we derived the following trellis-based algorithm.</Paragraph> <Paragraph position="4"> Trellis-based algorithm 1. Compute inner probabilities a(i,j,a) = Pr\[o&quot; Wij\] where a E VN and Wij denotes the substring</Paragraph> <Paragraph position="6"> For CNF grammars, the trellis-based algorithm reduces to the Inside-Outside algorithm. We have tested the algorithm on both CNF grammars and non-CNF grammars. In either case, the estimated probabilities are asymptotically unbiased.</Paragraph> </Section> <Section position="6" start_page="7962" end_page="7962" type="metho"> <SectionTitle> 4. SCFGS WITH BIGRAM CONSTRAINTS </SectionTitle> <Paragraph position="0"> We now consider adding bigram relative frequencies as constraints on our stochastic context-free trees. The situation is shown in Figure 1. In this figure, a word string is shown with its bigram relationships and its underlying parse tree structure.</Paragraph> <Paragraph position="1"> In this model, we assume a given prior context-free distribution as given by fl(W1,N) (Equation 2). This prior distribution may be obtained via the trellis-based estimation algorithm (Section 3) applied to a training text or, alternatively, from a hand-parsed training text. We are also given bigram relative frequencies,</Paragraph> <Paragraph position="3"> where tri, aj E VT.</Paragraph> <Paragraph position="4"> Given this type of structure involving both hierarchical and bigram relationships, what probability distribution on word strings should we consider? The following theorem states the maximum entropy solution.</Paragraph> <Paragraph position="5"> Theorem 1 distribution maximizing the generalized entropy</Paragraph> <Paragraph position="7"> where Z is the normalizing constant.</Paragraph> <Paragraph position="8"> Remarks The specification of bigram constraints for h(.) is not necessary for the derivation of this theorem. The constraint function h(.) may be any function on the word string including general N-grams. Also, note that if the parameters o~a1,~,2 are all zero, then this distribution reduces to the unconstrained stochastic context-free model.</Paragraph> </Section> <Section position="7" start_page="7962" end_page="7962" type="metho"> <SectionTitle> 5. SIMULATION </SectionTitle> <Paragraph position="0"> For simulation purposes, we would like to be able to draw sample word strings from the maximum entropy distribution. The generation of such sentences for this language model cannot be done directly as in the unconstrained context-free model. In order to generate sentences, a random sampling algorithm is needed. A simple Metropolis-type algorithm is presented to sample from our distribution.</Paragraph> <Paragraph position="1"> The distribution must first be expressed in Gibbs form:</Paragraph> <Paragraph position="3"> Given this 'energy' E, the following algorithm generates a sequence of samples, {W 1, W 2, W3,...}, from this distribution. null Random sampling algorithm 1. perturb W i to W new 2. compute AE = E(W new) - E(W i) 3. if AE < 0 then Wi+T +_ 4. increment i and repeat step 1. In the first step, the perturbation of a word string is done as follows: 1. generate parses of the string W 2. choose one of these parses 3. choose a node in the parse tree 4. generate a subtree rooted at this node according to the prior rule probabilities 5. let the terminal sequence of the modified tree be the new word string W new.</Paragraph> <Paragraph position="4"> This method of perturbation satisfies the detailed balance conditions in random sampling. Proposition Given a sequence of samples {W 1, W 2, W3,...} generated with the random sampling algorithm above. The sequence converges weakly to the distribution Pr(W1,N).</Paragraph> </Section> <Section position="8" start_page="7962" end_page="7962" type="metho"> <SectionTitle> 6. PARAMETER ESTIMATION FOR THE CONSTRAINED CONTEXT-FREE MODEL </SectionTitle> <Paragraph position="0"> In the parameter estimation problem for the constrained context-free model, we are given an observed word string W1,N of terminal symbols and want to estimate the c~ parameters in the maximum entropy distribution, Pr(W1,N). One criterion in estimating these parameters is maximizing the likelihood given the observed data.</Paragraph> <Paragraph position="1"> Maximum likelihood estimation yields the following condition for the optimum (ML) estimates:</Paragraph> <Paragraph position="3"> Evaluating the left hand side gives the following maximum likelihood condition Ea .... b \[ha',ab(Wl,g)\] = hdeg.,degb(W1,N) (O) One method to obtain the maximum likelihood estimates is given by Younes \[5\]. His estimation algorithm uses a random sampling algorithm to estimate the expected value of the constraints in a gradient descent framework. Another method is the pseudolikelihood approach which we consider here.</Paragraph> <Paragraph position="4"> In the pseudolikelihood approach, an approximation to the likelihood is derived from local probabilities \[1\]. In our problem, these local probabilities are given by:</Paragraph> <Paragraph position="6"> Maximizing the pseudolikelihood PS is equivalent to maximizing the log-pseudolikelihood,</Paragraph> <Paragraph position="8"> We can estimate the oL parameters by maximizing the log-pseudolikelihood with respect to the c,'s. The algorithm that we use to do this is a gradient descent algorithm. The gradient descent algorithm is an iterative algorithm in which the parameters are updated by a factor of the gradient, i.e.,</Paragraph> <Paragraph position="10"> a I t0&quot;2 O&quot; 1 ~0~ where # is the step size and the gradient is given by</Paragraph> <Paragraph position="12"> The gradient descent algorithm is sensitive to the choice of step size #. This choice is typically made by trial and error.</Paragraph> </Section> class="xml-element"></Paper>