File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1060_metho.xml

Size: 16,082 bytes

Last Modified: 2025-10-06 14:12:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-1060">
  <Title>An Algorithm for Estimating the Parameters of Unrestricted I-Iidden Stochastic Context-Free Grammars</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TERMINOLOGY
</SectionTitle>
    <Paragraph position="0"> The training corpus can be conwmiently segmented into sentences for puposes of training; each sentence comprisinga sequence of words. A typical one may consist ofY + 1 words, indexed from O to Y: The lookup function W(y) returns the index k of the vocabulary entry vk matching tile word w~ at position y ill tile sentence.</Paragraph>
    <Paragraph position="1"> The algorithm uses a extension of the representation and terminology used for &amp;quot;hidden Markov modeis'(hidden stochastic regular grammars) for which the Baum-Welch algorithm (Baum, 1972) is applicable (and which is also called the Forward/Backward (F/B)algo~ rithm). Grammar rules are represented as networks and illustrated graphically, maintaining a correspondence  with the trellis structure on which the computation can be conveniently repre~nted. The terminology is closely related to that of Levinson, Rabiner &amp; Sondhi (1983) and also Lari &amp; Young (1990).</Paragraph>
    <Paragraph position="2"> A set ofA f different nonterminals are represented by A; networks. A component network for the nonterminal labeled n has a parameter set (A, B, I, N,F, Top, n). To uniquely identify an element of the parameter set requires that it be a function of its nonterminal label e.g. A(n), l(n) etc.). However this notation has been topped to make formulae less cumbersome. A network labeled NP is shown in Figure 1 which represents the following rules:</Paragraph>
    <Paragraph position="4"> The rule NP =:~ Noun (0.2) means that if the NP rule is used, the probability is 0.2 that it produces a single Noun. In Figure 1, states are represented by circles with numbers inside to index them. NonierminMstates * re shown with double circles and represent references to other networks, such as ADJP. States marked with single circles ate called terminal states and represent part-of-speech categories. When a transition is made to a terminal state, a word of the current training sentence is generated. The word must have the same category as the state that generated it. Rules of the form Noun =:~ &amp;quot;cat&amp;quot; (0.002) and Noun ==~ &amp;quot;dog&amp;quot; (0.001) are collapsed into a state-dependent probability vector b(j), Network NP</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ADJP Noun 0.4~ F
0.4 Det
</SectionTitle>
    <Paragraph position="0"> which is an element of the output matrix B. Elements of the vector such as b(j W(y))represent the probability of seeing word wy in terminal state j. A transition to a nonterminal state does not in itself generate any words, however terminal states within the referenced network will do so. The parameter N is a matrix which indicates the label (e.g. n, NP, ADJP) of the network that a nonterminal state refers to. The probability of making a transition from state i to state j is labeled a(i, j) and collectively these probabilities form the transition matrix A. The initial matrix I contains the production probabilities for rules that are modelled by the network. They are indicated in Figure 1 as numbers beneath the state, if they are non-zero, l(i) can be equivalently viewed as the probability that some sub-sequence of n is started at state i. The parameter F is the set of final states; any sequence accepted by the network must terminate on a final state. In Figure 1 final states are designated with the annotation &amp;quot;F'. The boolean value Top indicates whether the network is the top-level network. Only one network may be assigned as the top-level network, which models productions in-ACT~ DE COLING-92, N^~T.S, 23-28 Aot~r 1992 3 8 8 Paoc. oF COLING-92, NA~rrEs, AuG. 23-28, 1992 volving the root symbol of a grammar.</Paragraph>
    <Paragraph position="1"> An equivalent network for the same set of rules is shown in Figure 2. The lexical rules can be written compactly as networks, with fewer states. The transitions from the determiner state each have probability 0.5 (i.e a(1, 2) : a(1,3) = 0.5). It should be noted that the algorithm can operate on either network.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TRELLIS DIAGRAM
</SectionTitle>
    <Paragraph position="0"> &amp;quot;lYellis dia~rsans conveniently relate computational quantities to the network structure and a training sentence. Each network u has a set of Y + 1 trellises for subsequences of a sentence Wo...wy, starting at each different position and ending at subsequent ones. A single trellis spanning positions 0...2 is shown ill Figure 4 for network NP. Nonterminal states are associated with a row of start nodes indicating where daughter constituents may start, and a row of end nodes that indicate where they end. A pair of start/end nodes thus refer to a daughter nonterminal constituent. In Figure 4, the ADJP network is referenced via the start state at position O. An adjective is then generated by a terminal state in the trellis for the ADJP network, followed by a transition and another adjective. The ADJP network is left at position 1, and a transition is made to the noun state where the word %at&amp;quot; is generated. Terminal states are associated with a single row of nodes in the trellis (they represent terminal productions that span only a single position). The path taken through the trellis is shown with broken a line.</Paragraph>
    <Paragraph position="1"> A path through different trellises has a corresponding unique tree representation, as exemplified in Figure 5.</Paragraph>
    <Paragraph position="2"> In cases where para~ are ambiguous, several paths exist corresponding to the alternative derivations. We shall next consider the computation of the probabilities of the paths. Two basic quantities are involved, namely alpha and beta probabilities. Loosely speaking, the alphas represent probabilities of subtrees associated with nonterminals, while the betas refer to the rest of the tree structure external to the subtrees. Subsequently, products of these quantities will be formed, which represent the probabilities of productions being used in generating a sentence. These are summed over all sentences in the training corpus to obtain the expected number of times each production is used, based on the current production probabilities and the training corpus. These are used like frequency counts would be for a parsed corpus, to form ratios that represent new estimates of the production probabilities. The procedure is iterated several times until the estimates do not change between iterations (i.e. the overall likelihood of producing the training corpus no longer increases).</Paragraph>
    <Paragraph position="3"> The algorithm makes use of one set of trellis diagrams to compute alpha probabilities, and another for beta probabilities. These are both split into terminal, nonterminal-start and nontermiual-end probabilities, corresponding to the three different types of nodes in the trellis diagram. The alpha set are labeled at, c~,~t, and ante respectively. The algorithm was originally formulated using solely the trellis representation (Kupiec, 1991) however the definitions that follow will  ACRES DE COLING-92, NANTES, 23-28 Ao~r 1992 3 8 9 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 also be related to the consituent structures used in the equivalent parse trees. In the following equations, three sets will be mentioned: 1. Term(n) The set of terminal states in network n.</Paragraph>
    <Paragraph position="4"> 2. Nonterm(n) This is the set of nonterminal states in network n.</Paragraph>
    <Paragraph position="5"> 3. Final(n) The set F of final states in network n.</Paragraph>
    <Paragraph position="6"> at(z, y, j, n): The probability that network n generates the words w,...w~ inclusive and is at the node for terminal state j at position y.</Paragraph>
    <Paragraph position="7"> ~,(~, v, J, n) =</Paragraph>
    <Paragraph position="9"> and whose next extension will involve trees dominated by N(p, n), the nonterminal referred to by state p.</Paragraph>
    <Paragraph position="10"> elate(z, y, p, n): The probability that network n generates the words w~:...wy inclusive, and is at the end node of nonterminal state p at position y.</Paragraph>
    <Paragraph position="12"> crnte(x,y,p, n) represents the probability of a con(l) stituent for n that spans x...y, formed by extending the various constituents ctnts(x,v,p,n) (ending at v - 1) with corresponding completed constituents starting at (2) v, ending at y and dominated by N(p, n).</Paragraph>
    <Paragraph position="13"> at(::, y, j, n) represents a constituent for nontermihal n spanning positions x...y. It is formed by extending an incomplete constituent for n, by addition of the terminal w v at state j. The two terms indicate cases where the constituent previously ended on either a terminal or another constituent completed at y - 1 (as in Figure 5, where the complete ADJP constituent is followed by the noun &amp;quot;eat&amp;quot;). If j is a final state the extended constituent is complete.</Paragraph>
    <Paragraph position="14"> antJ (z, y, p, n): The probability that network n generates the words wr...wv_l inclusive, and is at the start node of nonterminal state p at pc~ition y.</Paragraph>
    <Paragraph position="16"> ant, (x, y, p, n) represents an incomplete constituent for nonterminal n whose left subtrees span z...y- 1, The quantity Oqot,a(v, y, n) refers to the probability that network n generates the words w~...w~ inclusive and is in a final state of n at position y. Equivalently it is the probability that nonterminal n dominates all derivations that span positions v...y. The Cttotat probabilities correspond to the &amp;quot;Inner&amp;quot; (bottom-up) probabilities of the I/O algorithm. If a network corresponding to Chomsky normal form is substituted in equation (6), the reeursion for the inner probabilities of the I/O algorithm will be produced after further substitutions using equations (1)-(6).</Paragraph>
    <Paragraph position="17"> In the previous equations (5) and (6) it can be seen that the a,,~, probabilities for a network are defined recursively. They will never be self-referential if the grammar is cycle-free, (i.e. there are no derivations A =:~ A for any nonterminal production A). Although only cycle-free grammars are of interest here, it is worth mention that if cycles do exist (with associated probabilities less than unity), the recursions form a geometric series which has a finite sum.</Paragraph>
    <Paragraph position="18"> The alpha probabilities are all computed first because the beta probabilities make use of them. The latter are defined recursively in terms of trees that are external to a given constituent, and as a result the recursions are less obvious than those for the alpha probabilities. The basic recursion rests on the quantity '6,,~ which involves the following functions .6above and fl, la,:</Paragraph>
    <Paragraph position="20"> Given a constituent n spanning x...y, fla6ove(~, Y, n) indicates how the constituents spanning v...y mid labeled m that immediately dominate n relate to the constituents that are in turn external to m via flute(v, y, r,m). This situation is shown in Figure 6, where for simplicity ~nte(v,y, r, m) has not been graphically indicated.</Paragraph>
    <Paragraph position="21"> Note that m can dominate x...y as well as left subtrees.</Paragraph>
    <Paragraph position="22"> fl.ide(X, y, l, n) defines another reeursion for a constituent labeled n that spans x...y, and is in state I at time y. The recursion relates the addition of right sub-trees (spanning y+ 1...w) to the remaining external tree, via flnt~(x, w, q, n). This is depicted in Figure 7 (again the external trees repret~nted by time(x, w, q, n) are not shown). Also omitted from the figure is the first term of Equation 8 which relates the addition of a single terminal at position y + 1 to an external tree defined by fit(x, y + 1,i, n). fir and the various other probabilities for a beta trellis are defined next: fit (x, y, j, n): The probability of generating the prefix wo...w~-i and suffix w~+l...wY given that network n generated wz...wy and is in terminal state j at position y. The indicator function Ind 0 is used in subsequent equations. Its value is unity when its argument is true and zero otherwise.</Paragraph>
    <Paragraph position="24"/>
    <Paragraph position="26"> Tile first term in Equation 9 describes tile relationship of the tree external to x...y + 1 to tile tree external to x...y, with r~pect to state j generating the terminal wv4. l at time y + 1. If the constituent n spamfing x...y is complete, the second term describes the probability of the external tree via coastituents that immediately dominate n.</Paragraph>
    <Paragraph position="27"> AcrEs DE COLING-92, NANlXS, 23-28 AoPSrr 1992 3 9 1 PRoc. oF COL1NG-92, NANTES. AUG. 23-28, 1992 flnte(x, y,p, n): The probability of generating the prefix wo...w,,_~ and suffix w~+x...wy given that network n generated wr...w~ andthe end node of state p is reached at position y.</Paragraph>
    <Paragraph position="28"> flnt.(x, y, p, n) = fl.ia.(z, y, p, n)</Paragraph>
    <Paragraph position="30"> mula for fit, but is used with nonterminal states. Via fla~,ov~ mad/3slae it relates the tree external to the constituent n (spanning x...y) to the trees external to v...y and z...w. During the recursion, the trees shown in Figures 6 and 7 are substituted into each other (at the positions shown with shaded areas). Thus the external trees are successively extended to the left and right, until the root of the outerm~t tree is reached. It can be seen that the values for j3nt~(x,y,p, n) are defined in terms of those in other networks which reference n via /3~bo~e. As a result this computation has a top-down order. In contrast, the cunte(z,y,p , n) probabilities involve other networks that are referred to by network n and so assigned in a bottom-up order. If the network topology for Chomsky normal form is substituted in equation (12), the reeur~ion for the &amp;quot;Outer&amp;quot; probabilities of the I/O algorithm can be derived after further substitutions. The ~ntt probabilities for final states then correspond to the outer probabilities.</Paragraph>
    <Paragraph position="31"> 13,to(X,y,p, n): The probability of generating the prefix wo...w~-i and suffix w~...wy given that network n generated w~...ww_ x and is at the start node of state p at position y.</Paragraph>
    <Paragraph position="33"/>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
RE-ESTIMATION FORMULAE
</SectionTitle>
    <Paragraph position="0"> Once the alpha and beta probabilities are available, it is a straightforward matter to obtain new parameter estimates (A, B, I). The total probability P of a sentence in found from the top-level network nTop.</Paragraph>
    <Paragraph position="2"> There are four different kinds of transition:  1. Terminalnode i to terminal node j. 2. Terminal node i to nonlerminal start node p. 3. Non~erminal end node p to nonierminal start q. 4. Nonterminal end node p to terminal node i.  The expected total number of times a transition is made from state i to state j conditioned on the observed sentence is E(C/ij). The following formulae give E(C/) for each of the above eases:</Paragraph>
    <Paragraph position="4"> A new estimate 5(i, j) for a typical transition is then:</Paragraph>
    <Paragraph position="6"> Only B matrix elements for terminal states are used, and are re-estimated as follows. The expected total number of times the k'th vocabulary entry vk is generated in state i conditioned on the observed sentence is E(yl,k). A new estimate for b(i, k) can then be found:</Paragraph>
    <Paragraph position="8"/>
  </Section>
class="xml-element"></Paper>
Download Original XML