XML Viewer - p94-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1011_metho.xml
Size: 13,720 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1011">
  <Title>PRECISE N-GRAM PROBABILITIES FROM STOCHASTIC CONTEXT-FREE GRAMMARS</Title>
  <Section position="5" start_page="74" end_page="74" type="metho">
    <SectionTitle>
MOTIVATION
</SectionTitle>
    <Paragraph position="0"> There are good arguments that SCFGs are in principle not adequate probabilistic models for natural languages, due to the conditional independence assumptions they embody (Magerman and Marcus, 1991; Jones and Eisner, 1992; Briscoe and Carroll, 1993). Such shortcomings can be partly remedied by using SCFGs with very specific, semantically oriented categories and rules (Jurafsky et al., 1994). If the goal is to use n-grams nevertheless, then their their computation from a more constrained SCFG is still useful since the results can be interpolated with raw n-gram estimates for smoothing.</Paragraph>
    <Paragraph position="1"> An experiment illustrating this approach is reported later in the paper.</Paragraph>
    <Paragraph position="2"> On the other hand, even if vastly more sophisticated language models give better results, r~-grams will most likely still be important in applications such as speech recognition. The standard speech decoding technique of frame-synchronous dynamic programming (Ney, 1984) is based on a first-order Markov assumption, which is satisfied by bi-grams models (as well as by Hidden Markov Models), but not by more complex models incorporating non-local or higher-order constraints (including SCFGs). A standard approach is therefore to use simple language models to generate a preliminary set of candidate hypotheses. These hypotheses, e.g., represented as word lattices or N-best lists (Schwartz and Chow, 1990), are re-evaluated later using additional criteria that can afford to be more costly due to the more constrained outcomes. In this type of setting, the techniques developed in this paper can be used to compile probabilistic knowledge present in the more elaborate language models into n-gram estimates that improve the quality of the hypotheses generated by the decoder.</Paragraph>
    <Paragraph position="3"> Finally, comparing directly estimated, reliable n-grams with those compiled from other language models is a potentially useful method for evaluating the models in question. For the purpose of this paper, then, we assume that computing n-grams from SCFGs is of either practical or theoretical interest and concentrate on the computational aspects of the problem.</Paragraph>
    <Paragraph position="4"> It should be noted that there are alternative, unrelated methods for addressing the problem of the large parameter space in n-gram models. For example, Brown et al. (1992) describe an approach based on grouping words into classes, thereby reducing the number of conditional probabilities in the model.</Paragraph>
  </Section>
  <Section position="6" start_page="74" end_page="76" type="metho">
    <SectionTitle>
THE ALGORITHM
</SectionTitle>
    <Paragraph position="0"> Normal form for SCFGs A grammar is in Chomsky Normal Form (CNF) if every production is of the form A ~ B C or A ~ terminal. Any CFG or SCFG can be converted into one in CNF which generates exactly the same language, each of the sentences with exactly the same probability, and for which any parse in the original grammar would be reconstructible from a parse in the CNF grammar. In short, we can, without loss of generality, assume that the SCFGs we are dealing with are in CNF. In fact, our algorithm generalizes straightforwardly to the more general Canonical Two-Form (Graham et al., 1980) format, and in the case of bigrams (n =- 2) it can even be modified to work directly for arbitrary SCFGs. Still, the CNF form is convenient, and to keep the exposition simple we assume all SCFGs to be in CNF.</Paragraph>
    <Paragraph position="1"> Probabilities from expectations The first key insight towards a solution is that the n-gram probabilities can be obtained from the associated expected frequencies for n-grams and (n - 1)-grams:</Paragraph>
    <Paragraph position="3"> where c(wlL ) stands for the expected count of occurrences of the substring w in a sentence of L.1 Proof Write the expectation for n-grams recursively in terms of those of order n - 1 and the conditional n-gram probabilities: C(Wl...Wr~\[L) ~--- C(Wl...W~_llL)P(w~lw lw2...wr~_l). So if we can compute c(wlG) for all substrings w of lengths n and n - 1 for a SCFG G, we immediately have an n-gram grammar for the language generated by G.</Paragraph>
    <Paragraph position="4"> Computing expectations Our goal now is to compute the substring expectations for a given grammar. Formalisms such as SCFGs that have a recursive rule structure suggest a divide-and-conquer algorithrn that follows the recursive structure of the grammar, z We generalize the problem by considering c(wIX), the expected number of (possibly overlapping) occurrences of</Paragraph>
    <Paragraph position="6"> sought, where S is the start symbol for the grammar.</Paragraph>
    <Paragraph position="7"> Now consider all possible ways that nonterminal X can generate string w = wl ... wn as a substring, denoted by X ::~ ... wl * .. wn .... and the associated probabilities. For each production of X we have to distinguish two main cases, assuming the grammar is in CNF. If the string in question is of length I, w = wl, and if X happens to have a production X --~ Wl, then that production adds exactly P(X --~ wt) to the expectation c(w IX).</Paragraph>
    <Paragraph position="8"> If X has non-terminal productions, say, X ~ YZ then w might also be generated by recursive expansion of the right-hand side. Here, for each production, there are three subcases.</Paragraph>
    <Paragraph position="9">  (a) First, Y can by itself generate the complete w (see Figure l(a)).</Paragraph>
    <Paragraph position="10"> (b) Likewise, Z itself can generate w (Figure l(b)). (c) Finally, Y could generate wl ... wj as a suffix (Y ~R wl...wj) and Z, Wj+l...wn as a prefix (Z ~L  wj+l ... w,O, thereby resulting in a single occurrence of w (Figure l(c)). 3 Each of these cases will have an expectation for generating wl ... wn as a substring, and the total expectation c(w}X) will be the sum of these partial expectations. The total expectations for the first two cases (that of the substring being completely generated by Y or Z) are given recursively:</Paragraph>
    <Paragraph position="12"> where one has to sum over all possible split points j of the string w.</Paragraph>
    <Paragraph position="13"> 3We use the notation X =~R c~ to denote that non-terminal X generates the string c~ as a suffix, and X :~z c~ to denote that X generates c~ as a prefix. Thus P(X :~t. ~) and P(X ::~n o~) are the probabilities associated with those events.</Paragraph>
    <Paragraph position="14"> To compute the total expectation c(wlX), then, we have to sum over all these choices: the production used (weighted by the rule probabilities), and for each nonterminal rule the three cases above. This gives</Paragraph>
    <Paragraph position="16"> In the important special case of bigrams, this summation simplifies quite a bit, since the terminal productions are ruled out and splitting into prefix and suffix allows but one possibility: null</Paragraph>
    <Paragraph position="18"> We now have a recursive specification of the quantities c(wlX ) we need to compute. Alas, the recursion does not necessarily bottom out, since the c(wlY) and c(wlZ) quantities on the right side of equation (3) may depend themselves on c(wlX). Fortunately, the recurrence is linear, so for each string w, we can find the solution by solving the linear system formed by all equations of type (3). Notice there are exactly  as many equations as variables, equal to the number of non-terminals in the grammar. The solution of these systems is further discussed below.</Paragraph>
    <Paragraph position="19"> Computing prefix and suffix probabilities The only substantial problem left at this point is the computation of the constants in equation (3). These are derived from the rule probabilities P(X ~ w) and P(X --+ YZ), as well as the prefix/suffix generation probabilities P(Y =~R wl ... wj) and P(Z =~z wj+l ... w,~).</Paragraph>
    <Paragraph position="20"> The computation of prefix probabilities for SCFGs is generally useful for applications, and has been solved with the LRI algorithm (Jelinek and Lafferty, 1991). Recently, Stolcke (1993) has shown how to perform this computation efficiently for sparsely parameterized SCFGs using a probabilistic version of Earley's parser (Earley, 1970). Computing suffix probabilities is obviously a symmetrical task; for example, one could create a 'mirrored' SCFG (reversing the order of right-hand side symbols in all productions) and then run any prefix probability computation on that mirror grammar. null Note that in the case of bigrams, only a particularly simple form of prefix/suffix probabilities are required, namely, the 'left-corner' and 'right-corner' probabilities, P(X ~z wl) and P(Y ~ R w2), which can each be obtained from a single matrix inversion (Jelinek and Lafferty, 1991).</Paragraph>
    <Paragraph position="21"> It should be mentioned that there are some technical conditions that have to be met for a SCFG to be well-defined and consistent (Booth and Thompson, 1973). These condition are also sufficient to guarantee that the linear equations given by (3) have positive probabilities as solutions. The details of this are discussed in the Appendix.</Paragraph>
    <Paragraph position="22"> Finally, it is interesting to compare the relative ease with which one can solve the substring expectation problem to the seemingly similar problem of finding substringprobabilities: the probability that X generates (one or more instances of) w. The latter problem is studied by Corazza et al. (1991), and shown to lead to a non-linear system of equations. The crucial difference here is that expectations are additive with respect to the cases in Figure 1, whereas the corresponding probabilities are not, since the three cases can occur simultaneously. null</Paragraph>
  </Section>
  <Section position="7" start_page="76" end_page="76" type="metho">
    <SectionTitle>
EFFICIENCY AND COMPLEXITY ISSUES
</SectionTitle>
    <Paragraph position="0"> Summarizing from the previous section, we can compute any n-gram probability by solving two linear systems of equations of the form (3), one with w being the n-gram itself and one for the (n - 1)-gram prefix wl ... wn-1. The latter computation can be shared among all n-grams with the same prefix, so that essentially one system needs to be solved for each n-gram we are interested in. The good news here is that the work required is linear in the number of n-grams, and correspondingly limited if one needs probabilities for only a subset of the possible n-grams. For example, one could compute these probabilities on demand and cache the results.</Paragraph>
    <Paragraph position="1"> Let us examine these systems of equations one more time.</Paragraph>
    <Paragraph position="2"> Each can be written in matrix notation in the form</Paragraph>
    <Paragraph position="4"> where I is the identity matrix, A = (axu) is a coefficient matrix, b = (bx) is the right-hand side vector, and c represents the vector of unknowns, c(wlX ). All of these are indexed by nonterminals X, U.</Paragraph>
    <Paragraph position="5"> We get</Paragraph>
    <Paragraph position="7"> where 6(X, Y) = 1 ifX = Y, and 0 otherwise. The expression I - A arises from bringing the variables c(wlY ) and c(wlZ ) to the other side in equation (3) in order to collect the coefficients.</Paragraph>
    <Paragraph position="8"> We can see that all dependencies on the particular bigram, w, are in the right-hand side vector b, while the coefficient matrix I - A depends only on the grammar. This, together with the standard method of LU decomposition (see, e.g., Press et al. (1988)) enables us to solve for each bigram in time O(N2), rather than the standard O(N 3) for a full system (N being the number of nonterminals/variables). The LU decomposition itself is cubic, but is incurred only once. The full computation is therefore dominated by the quadratic effort of solving the system for each n-gram. Furthermore, the quadratic cost is a worst-case figure that would be incurred only if the grammar contained every possible rule; empirically we have found this computation to be linear in the number of nonterminals, for grammars that are sparse, i.e., where each nonterminal makes reference only to a bounded number of other nonterminals.</Paragraph>
  </Section>
  <Section position="8" start_page="76" end_page="77" type="metho">
    <SectionTitle>
SUMMARY
</SectionTitle>
    <Paragraph position="0"> Listed below are the steps of the complete computation. For concreteness we give the version specific to bigrams (n = 2).</Paragraph>
    <Paragraph position="1">  1. Compute the prefix (left-corner) and suffix (rightcorner) probabilities for each (nonterminal,word) pair. 2. Compute the coefficient matrix and right-hand sides for the systems of linear equations, as per equations (4) and (5).</Paragraph>
    <Paragraph position="2"> 3. LU decompose the coefficient matrix.</Paragraph>
    <Paragraph position="3"> 4. Compute the unigram expectations for each word in the grammar, by solving the LU system for the unigram right-hand sides computed in step 2.</Paragraph>
    <Paragraph position="4"> 5. Compute the bigram expectations for each word pair by solving the LU system for the bigram right-hand sides computed in step 2.</Paragraph>
    <Paragraph position="5">  . Compute each bigram probability P (w2 \]wl ), by dividing the bigram expectation c(wlw2\[S) by the unigram expectation C(Wl IS).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML