File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1508_metho.xml
Size: 12,121 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1508"> <Title>Stochastic Multiple Context-Free Grammar for RNA Pseudoknot Modeling</Title> <Section position="4" start_page="57" end_page="58" type="metho"> <SectionTitle> 2 Stochastic Multiple Context-Free Grammar </SectionTitle> <Paragraph position="0"> A stochastic multiple context-free grammar (stochastic MCFG, or SMCFG) is a probabilistic extension of MCFG (Kasami et al., 1988; Seki et al., 1991) or linear context-free rewriting system (Vijay-Shanker et al., 1987). An SMCFG is a 5tuple G = (N,T,F,P,S) where N is a finite set of nonterminals, T is a finite set of terminals, F is a finite set of functions, P is a finite set of (production) rules and S [?] N is the start symbol. For each A [?] N, a positive integer denoted by dim(A) is given and A derives dim(A)-tuples of terminal sequences. For the start symbol S, dim(S) = 1.</Paragraph> <Paragraph position="1"> For each f [?] F, positive integers di (0 [?] i [?] k) are given and f is a total function from (T[?])d1 x ***x(T[?])dk to (T[?])d0 where each component of f is defined as the concatenation of some components of arguments and constant sequences. Note that each component of an argument should occur in the function value at most once (linearity). For example, f[(x11,x12),(x21,x22)] = (x11x21,x12x22). Each rule in P has the form of A0 p- f[A1,...,Ak] where Ai [?] N (0 [?]</Paragraph> <Paragraph position="3"> p [?] 1 called the probability of this rule. The summation of the probabilities of the rules with the same left-hand side should be one. If we are not interested in p, we just write A0 - f[A1,...,Ak].</Paragraph> <Paragraph position="4"> If k [?] 1, then the rule is called a nonterminating rule, and if k = 0, then it is called a terminating rule. A terminating rule A0 - f[ ] with f[h][ ] = bh (1 [?] h [?] dim(A0)) is simply written as A0 - (b1,...,bdim(A0)).</Paragraph> <Paragraph position="5"> We recursively define the relation [?]= by the following (L1) and (L2): (L1) if A p- a [?] P (a [?] (T[?])dim(A)), then we write A [?]= a with probability p, and (L2) if A p- f[A1,...,Ak] [?] P and Ai [?]= ai [?] (T[?])dim(Ai) (1 [?] i [?] k) with probabilities p1,...,pk, respectively, then we write A [?]= f[a1,...,ak] with probability p *[?]ki=1 pi. In parallel with the relation [?]=, we define derivation trees as follows: (D1) if A p-a [?] P (a [?] (T[?])dim(A)), then the ordered tree with the root labeled A which has a as the only one child is a derivation tree for a with probability p, and (D2) if A p- f[A1,...,Ak] [?] P, Ai [?]= ai [?] (T[?])dim(Ai) (1 [?] i [?] k) and t1,...,tk are derivation trees for a1,...,ak with probabilities p1,...,pk, respectively, then the ordered tree with the root labeled A (or A : f if necessary) which has t1,...,tk as (immediate) subtrees from left to right is a derivation tree for f[a1,...,ak] with probability p *[?]ki=1 pi. Example rules are A 0.3- f[A] where f[(x1,x2)] = (ax1b,cx2d) and A 0.7- (ab,cd). Then, A [?]= (ab,cd) by the second rule, which is followed by A [?]= f[(ab,cd)] = (aabb,ccdd) by the first rule. The probability of the latter derivation is 0.3 * 0.7 = 0.21. The language generated by an SMCFG G is defined as L(G) = {w [?] T[?] |S [?]=</Paragraph> <Paragraph position="7"> w with probability greater than 0}.</Paragraph> <Paragraph position="8"> In this paper, we focus on an SMCFG Gs = (N,T,F,P,S) that satisfies the following conditions: Gs has m different nonterminals denoted by W1,...,Wm, each of which uses the only one type of a rule denoted by E, S, D, B1, B2, B3, B4, U1L, U1R, U2L, U2R or P 1 (see Table 1). The type of Wv is denoted by type(v) and we predefine type(1) = S, that is, W1 is the start symbol. Consider a sample rule set Wv - UPa1L[Wy] | UPa1L[Wz] where UPa1L[(x1,x2)] = (ax1,x2) and a [?] T. For each rule r, two real values called transition probability p1 and emission probability p2 are specified in Table 1. The probability of r is simply defined as p1 * p2. In application, p1 = tv(y) and p2 = ev(ai),... in Table 1 are the parameters of the grammar, which are set by hand or by a training algorithm (Section 3.3) depending on the set of possible sequences to be analyzed.</Paragraph> </Section> <Section position="5" start_page="58" end_page="61" type="metho"> <SectionTitle> 3 Algorithms for SMCFG </SectionTitle> <Paragraph position="0"> In RNA structure analysis using stochastic grammars, we have to deal with the following three problems: (1) calculate the optimal alignment of a sequence to a stochastic grammar (alignment problem), (2) calculate the probability of a sequence given a stochastic grammar (scoring problem), and (3) estimate optimal probability parameters for a stochastic grammar given a set of example sequences (training problem). In this section, we give solutions to each problem for the specific SMCFG Gs = (N,T,F,P,S).</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.1 Alignment Problem </SectionTitle> <Paragraph position="0"> The alignment problem for Gs is to find the most probable derivation tree for a given input se1These types stand for END, START, DELETE, BIFURCA-TION, UNPAIR and PAIR respectively.</Paragraph> <Paragraph position="1"> quence. This problem can be solved by a dynamic programming algorithm similar to the CYK algorithm for SCFGs (Durbin et al., 1998), and in this paper, we also call the parsing algorithm for Gs the CYK algorithm. We fix an input sequence w = a1***an (|w |= n). Let gv(i,j) and gy(i,j,k,l) be the logarithm of maximum probabilities of a derivation subtree rooted at a nonterminal Wv for a terminal subsequence ai***aj and of a derivation subtree rooted at a nonterminal Wy for a tuple of terminal subsequences (ai***aj,ak ***al) respectively. The variables gv(i,i[?]1) and gy(i,i[?] 1,j,j [?] 1) are the logarithm of maximum probabilities for an empty sequence e and a pair of e. Let tv(i,j) and ty(i,j,k,l) be traceback variables for constructing a derivation tree, which are calculated together with gv(i,j) and gy(i,j,k,l). We define Cv = {y |Wv - f[Wy] [?] P, f [?] F}.</Paragraph> <Paragraph position="2"> To avoid non-emitting cycles, we assume that the nonterminals are numbered such that v < y for all y [?] Cv. The CYK algorithm uses five dimensional dynamic programming matrix to calculate g, which leads to logP(w, ^pi |th) where ^pi is the most probable derivation tree and th is an entire set of probability parameters. The detailed description of the CYK algorithm is as follows: Algorithm 1 (CYK).</Paragraph> <Paragraph position="3"> Initialization: for i - 1 to n + 1, j - i to n + 1, v - 1 to m</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> and [?]1Lv ,...[?]2Rv are set to 0 for the other types except P.</Paragraph> <Paragraph position="8"> When the calculation terminates, we obtain logP(w, ^pi |th) = g1(1,n). If there are b BI-FURCATION nonterminals and a other nonterminals, the time and space complexities of the CYK algorithm are O(amn4 + bn5) and O(mn4), respectively. To recover the optimal derivation tree, we use the traceback variables t. Due to limitation of the space, the full description of the traceback algorithm is omitted (see (Kato and Seki, 2006)).</Paragraph> </Section> <Section position="2" start_page="58" end_page="60" type="sub_section"> <SectionTitle> 3.2 Scoring Problem </SectionTitle> <Paragraph position="0"> As in SCFGs (Durbin et al., 1998), the scoring problem for Gs can be solved by the inside algorithm. The inside algorithm calculates the summed probabilities av(i,j) and ay(i,j,k,l) of all derivation subtrees rooted at a nonterminal Wv for a subsequence ai***aj and of all derivation subtrees rooted at a nonterminal Wy for a tuple of subsequences (ai***aj,ak ***al) respectively.</Paragraph> <Paragraph position="1"> The variables av(i,i[?]1) and ay(i,i[?]1,j,j[?]1) are defined for empty sequences in a similar way to the CYK algorithm. Therefore, we can easily obtain the inside algorithm by replacing max operations with summations in the CYK algorithm.</Paragraph> <Paragraph position="2"> When the calculation terminates, we obtain the probability P(w |th) = a1(1,n). The time and space complexities of the algorithm are identical with those of the CYK algorithm.</Paragraph> <Paragraph position="3"> In order to re-estimate the probability parameters of Gs, we need the outside algorithm. The outside algorithm calculates the summed probability bv(i,j) of all derivation trees excluding subtrees rooted at a nonterminal Wv generating a subsequence ai***aj. Also, it calculates by(i,j,k,l), the summed probability of all derivation trees excluding subtrees rooted at a non-terminal Wy generating a tuple of subsequences (ai***aj,ak ***al). In the algorithm, we will use</Paragraph> <Paragraph position="5"> that calculating the outside variables b requires the inside variables a. Unlike CYK and inside algorithms, the outside algorithm recursively works its way inward. The time and space complexities of the outside algorithm are the same as those of CYK and inside algorithms. Formally, the outside algorithm is as follows:</Paragraph> </Section> <Section position="3" start_page="60" end_page="61" type="sub_section"> <SectionTitle> 3.3 Training Problem </SectionTitle> <Paragraph position="0"> The training problem for Gs can be solved by the EM algorithm called the inside-outside algorithm where the inside variables a and outside variables b are used to re-estimate probability parameters.</Paragraph> <Paragraph position="1"> First, we consider the probability that a nonterminal Wv is used at positions i, j, k and l in a derivation of a single sequence w. If type(v) = S, the probability is 1P(w|th)av(i,j)bv(i,j), otherwise 1P(w|th)av(i,j,k,l)bv(i,j,k,l). By summing these over all positions in the sequence, we can obtain the expected number of times that Wv is used for w as follows: for type(v) = S, the expected</Paragraph> <Paragraph position="3"> bv(i,j,k,l).</Paragraph> <Paragraph position="4"> Next, we extend these expected values from a single sequence w to multiple independent sequences w(r) (1 [?] r [?] N). Let a(r) and b(r) be the inside and outside variables calculated for each input sequence w(r). Then we can obtain the expected number of times that a nonterminal Wv is used for training sequences w(r) (1 [?] r [?] N) by summing the above terms over all sequences: for</Paragraph> <Paragraph position="6"> a(r)v (i,j,k,l)b(r)v (i,j,k,l).</Paragraph> <Paragraph position="7"> Similarly, for a given Wy, the expected number of times that a rule Wv - f[Wy] is applied can be obtained as follows: for type(v) = S,</Paragraph> <Paragraph position="9"> For a given terminal a or a pair of terminals (a,b), the expected number of times that a rule containing a (or a and b) is applied is as shown below: for</Paragraph> <Paragraph position="11"> where d(C) is 1 if the condition C in the parenthesis is ture, and 0 if C is false.</Paragraph> <Paragraph position="12"> Now, we re-estimate probability parameters by using the above expected counts. Let ^tv(y) be re-estimated transition probabilities from Wv to Wy.</Paragraph> <Paragraph position="13"> Also, let ^ev(a) and ^ev(a,b) be re-estimated emission probabilities that Wv emits a symbol a and two symbols a and b respectively. We can obtain each re-estimated probability by the following equations:</Paragraph> <Paragraph position="15"> Note that the expected count correctly corresponding to its nonterminal type must be substituted for the above equations. In summary, the inside-outside algorithm is as follows: Algorithm 3 (Inside-Outside).</Paragraph> <Paragraph position="16"> Initialization: Pick arbitrary probability parameters of the model.</Paragraph> <Paragraph position="17"> Iteration: Calculate the new probability parameters using (3.1). Calculate the new log likelihood[?]</Paragraph> <Paragraph position="19"> Termination: Stop if the change in log likelihood is less than predefined threshold.</Paragraph> </Section> </Section> class="xml-element"></Paper>