File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1009_metho.xml
Size: 18,337 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1009"> <Title>COMBINATION OF N-GRAMS AND STOCHASTIC CONTEXT-FREE GRAMMARS FOR LANGUAGE MODELING*</Title> <Section position="3" start_page="0" end_page="55" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Language modeling is an important asl)e('t to consider in large-vocabulary Sl)eeeh recognition systenls (Bahl et al., 1983; ,lelinek, 1998). The n--grain models are the most widely-used for a wide range of domains (Bahl et al., 1983). The n-grams are simple and robust models and adequately capture the local restrictions between words. Moreover, it is well-known how to estimate the parameters of the lnodet and how to integrate them in a speech recognition system. However, the n-grmn models cannot adequately characterize the long-term constraints of the sentences of the tasks.</Paragraph> <Paragraph position="1"> On the other hand, Stochastic Context-Free Grammars (SCI)'Gs) allow us a better model* This work has been partially SUl)l)<)rted by the S1)anish CICYT under contract (TIC98/0423-C(16).</Paragraph> <Paragraph position="2"> ing of long-term relations and work well on lhnited-domain tasks of low perplexity. However, SCFGs work poorly for large-vocabulary, general-purpose tasks because learning SCFGs and the (:Olnlmtation of word transition 1)roba bilities present serious 1)roblenls tLr ('olnplex real tasks.</Paragraph> <Paragraph position="3"> In the literature, a nulnber of works have proposed ways to generalize the n-gram models (.lelinek, 1998; Siu and Ostendorf, 2000) or com1)ining with other structural models (Bellegarda, 1998; Gilet and Ward, 1998; Chellm and Jelinek, 1998).</Paragraph> <Paragraph position="4"> In this iml)er, we present a confl)ined language model defined as a linear combination of n-grams, whk'h are llse(t to capture the local relations between words, and a stoehasti(: grammatieal model whi(:h is used to represent the glottal relation 1)etw(x'dl synl:aetie strll(;tllrt~s, hi or(ter to (:at)turc, these lollg-terltl relations an(t to solve the main 1)rolflems derived Dora the large-vocabulary complex tasks, we 1)l'Ol)ose here to detine: a eategory--ba,~ed SCFG and a prolmbilistic model of word distrilmt;ion in the categories. Taking into a(:count this proposal, we also describe here how to solve the learning of these stochastic models and their integrati(m prol> 1CIlIS.</Paragraph> <Paragraph position="5"> With regard to the learning problem, several algorithms that learn SCFGs by means of estimation algorithms have been 1)reposed (Lari and Young, 1990; Pereira and Schal)es, 1992; Sfinehez and Benedi, 1998), and pronfising resuits have been achieved with category-based SCFGs on real tasks (Sfi.nchez and Benedi, \]999).</Paragraph> <Paragraph position="6"> In relation to the integration problem, we l)resent two algorithms that compute the word transition 1)robability: the first algorithm is based on the l~efl;-to-ll,ight Inside algorithln (LRI) (Jelinek and Lafferty, 1991), and the second is based on an application of a Viterbi scheme to the LRI algorithm (the VLRI algorithm) (S~nehez and Benedf, 1997).</Paragraph> <Paragraph position="7"> Finally, in order to evaluate the behavior of this proposal, experiments with a part of the Wall Street Journal processed in the Penn Tree-bank project were carried out and significant improvements with regard to the classical n-gram models were achieved.</Paragraph> </Section> <Section position="4" start_page="55" end_page="55" type="metho"> <SectionTitle> 2 The language model </SectionTitle> <Paragraph position="0"> An important problem related to language modeling is the evaluation of Pr(wk I wl... wk-1).</Paragraph> <Paragraph position="1"> In order to compute this probability, we propose a hybrid language model defined as a sireple linear combination of n-gram models and a stochastic grammatical model G~:</Paragraph> <Paragraph position="3"> where 0 < c~ < 1 is a weight factor which depends on the task.</Paragraph> <Paragraph position="4"> The expression Pr(w/~lwk_n...wk-,) is the word probability of occurrence of w/~ given by the n-gram model. The parameters of this model can be easily estinmted, and the expression Pr(wl~lWl~_n... wtc-1) can be efficiently computed (Bahl et al., 1983; Jelinek, 1998).</Paragraph> <Paragraph position="5"> In order to define the stochastic grammatical model G~ of the expression</Paragraph> <Paragraph position="7"> complex tasks, we propose a combination of two different stochastic models: a category-based SCFG (G~), that allows us to represent the long-term relations between these syntactical structures and a probabilistic model of word distribution into categories (Cw).</Paragraph> <Paragraph position="8"> This proposal introduces two imlmrtant aspeels, which are the estimation of the parameters of the stochastic models, Gc and Cw, and the computation of the following expression:</Paragraph> <Paragraph position="10"/> </Section> <Section position="5" start_page="55" end_page="56" type="metho"> <SectionTitle> 3 Training of the models </SectionTitle> <Paragraph position="0"> The parameters of the described model are estimated fi'om a training sample, that is, from a set of sentences. Each word of the sentence has a part-of speech tag (POStag) associated to it. These POStags are considered as word categories and are the terminal symbols of the SCFG. h'om this training sample, the parameters of G~ and C~ can be estimated as tbllows. First, tile parameters of Cw, represented by Pr(w\[c), are computed as: = E,o, (3) where N(w,c) is the number of times that the word w has been labeled with the POStagc. It is important to note that a word w can belong to different categories. In addition, it may hapt)en that a word in a test set does not appear in the training set, and therefore some smoothing technique has to be carried out.</Paragraph> <Paragraph position="1"> With regard to the estimation of the category-based SCFGs, one of the most widely-known methods is the Inside-Outside (IO) algorithln (Lari and Young, 1990). The application of this algorithm presents important problems which are accentuated in real tasks: the time complexity per iteration and the large number of iterations that are necessary to converge. An alternative to the IO algorithm is a.n algorithm based on the Viterbi score (VS algorithm) (Ney, 1992). The convergence of the VS algorithm is faster than the IO algorithm. However, the SCFGs obtained are, in genera.l, not as well learned (Simchez et al., 1996).</Paragraph> <Paragraph position="2"> Another possibility for estimating SCFCs, which is somewhere between the IO and VS algorithms, has recently been proposed. This approach considers only a certain subset of derivations in the estimation process. In order to select this subset of derivations, two alternatives have been considered: froln structural information content in a bracketed corpus (Pereira and Schabes, 1992; Amaya et al., 1999), and from statistical information content in the k-best derivations (Sgmchez and Benedl, 1998).</Paragraph> <Paragraph position="3"> In the first alternative, the IOb and VSb algorithms which learn SCFGs from partially bracketed corpora were defined (Pereira and Schabes, 1992; Amaya et al., 1999). In the second alternative, the kVS algorithm for the estimation of the probability distributions of a SCFG fl'om the k-best derivations was proposed (Shnchez and Benedi, 1998).</Paragraph> <Paragraph position="4"> All of these algorithms have a tilne (:omi)lexity O('n,a\[PI), where 'n is the length of the input st;ring, and \[1)1 is the size. of the SCFG.</Paragraph> <Paragraph position="5"> These algorithms have been tested in real tasks fl)r estimating cat(,gory-1)ased SCFOs (Sfinchez and Benedf, 1999) and the results obtained justify their applicatiol, in complex real tasks.</Paragraph> </Section> <Section position="6" start_page="56" end_page="56" type="metho"> <SectionTitle> 4 Integration of the model </SectionTitle> <Paragraph position="0"> l?rom exl)ression (2), it can bee se(m that in order to integrate the too(M, it is necessary to efli(:iently ('oml)ute the expression: P~0,,~... ',,,k... la,.., <,,). (4) In order to describo how this computation (:an l)e m~de, we tirst introduce some notation.</Paragraph> <Paragraph position="1"> A Court:el-Free, Grammar G is a four-tul)le (N, E, P, S), wher(; N is the tinit(; set of nont(;rminals, )2 is the tinite sol; of terminals (N ~-/E = 0), S ~ N is the axiom or initial symbol and 1' is the finite set of t)rodu(:tions or ruh;s of the tbrm A -+ it, where A c N a.nd c~ C (N U E) + (only grmmmtrs with non (;mt)ty rules ar(; considered). FOI&quot; siml)li('ity (but without loss of g('.nerality) only (:ontext-iYee grammars in Ch, om.s'ky Normal Form are. considere(l, that is, grammars with rules of the form A -+ HC or A -> v wh(n'(:</Paragraph> </Section> <Section position="7" start_page="56" end_page="57" type="metho"> <SectionTitle> A,B,C C N and v ~ )2. </SectionTitle> <Paragraph position="0"> A Stoch, a.stic Contcxt-l';rc.c U'raw, w, wl&quot; G.~ is a pair (G,p), where G is a (:ontext-fr(,.(; grainmar and p : P -+\]0,1\] is a 1)robal)ility tim(:{;ion of rule ai)l)li('al;ion su(:h that VA ~ N: }~,c(Nu>~)+</Paragraph> <Paragraph position="2"> Now, we pr(:sent two algorithms ill order to compute the word transition 1)rol)at)ility. The first algorithm is based on the Ll/i algorithm, a.nd the second is based on an apt)li('atiou of a Viterbi s(:heme to the LRI algorithln (the VLI/\] a.lgorithm).</Paragraph> <Paragraph position="3"> Probability of generating an initial substring The COmlmtation of (4) is l)as('.d on an algorithm which is a modith:ation of the I,RI algorithm (aelinek and Lafl'erty, 1991). This new algorithln is based on the detlnition of Pr(A <<</Paragraph> <Paragraph position="5"> 1)robability that A generates the initial sul)string wi... wj... given Gc and C.,,,. This can l)e computed with the following (lynamic 1)rogrmmning s(:henl(;:</Paragraph> <Paragraph position="7"> In this cxi)ression, Q(A ~ D) is the probability that D is the leftmost nol:terminal in all sentential fOHllS which are derived from A. The vahu; Q(A ~ BC) is the probability that BC is th(; initial substring of all sentential forms deriv(;d from i\. Pr(H < i,l >) is th{; probability that the substring &quot;wi... wz is generated from/~ given G,: and C.,,,. Its contlmt;ation will be defined \]ater.</Paragraph> <Paragraph position="8"> It shouh:l be noted that th(; combination of the models G,. and C~,, in carried out in the vah:e P'r(A << i, i). This is the lnain difl:'crcnce with resp(wt the \]A/I algorithm.</Paragraph> <Paragraph position="9"> Probability of the best derivation generating an initial substring An algorithm whi(:\]l is similar to the previous (>he (-m~ l)e (l(~fin(~d t)ased on the \;iterl)i ,~(:lmme. In this way, it is l)onsil)le to obtain the \])cst; parsing of an initial sul)string. This new algorithm is also related to the \/'Lll.I algol'ithni (Shn(:hez and B('aw, di, 1997) and is 1)ased on the (lciinition of P,~'(A << ',:, J)) = P,~'(A ~ &quot;,,i . . . ',,j . . . IG~:, Cw) as the probability of the most probal)le 1)arsing which generates wi...wj.., from A given G,: and C,,. This can 1)(i (:omputcd as follows:</Paragraph> <Paragraph position="11"> In this expression, Q(A ~ D) is the t)rob al)ility that D is the leftmost nontermina, l in the most t)robable sentential form which is derived ti'om d. The value Q(A ~ BC) is the probability that BC is the initial substring of most the probable sentential form derived from A. Pr(B < i, 1 >) is the probability of the most probable parse which generates wi * * * wl froli1 B. Probability of generating a string The wflue Pr(A < i,j >) = Pr(A d> wi...'u;jlG~,Go) is defined as the probability that the substring wi... wj is generated from A given G~ and C,~. To calculate this probability a modification of the well-known Inside algorithm (Lari and Young, 1990) is proposed.</Paragraph> <Paragraph position="12"> This computation is carried out by using the following dynamic progralmning scheme:</Paragraph> <Paragraph position="14"> In this way, Pr(w~ ...whiGs, C,,) = Pr(S < 1,n >).</Paragraph> <Paragraph position="15"> As we have commented above, the combination of the two parts of the grammatical model is carried out in the value Pr(A < i, i >). Probability of the best derivation generating a string The t)rol)abitity of the best derivation that genel'~-gtes a string, Pr('u,1... ~t/2,~l~c, 6'w) , can be evaluated using a Viterbi-like scheme (Ney, 1992). As in the previous case, the computation of this probability is based on the definition of p .(A < g,j >) = pU-(A <,,) as the probability of the best derivation that generates the substring wi... wj fi'om A given Gc and Cw. Similarly:</Paragraph> <Paragraph position="17"> Finally, the time complexity of these algorithms is the same as the algorithms they are related to, there%re the time colnplexity is O(k:alrl), where tc is the length of the input string and IPI is the size of the SCFG.</Paragraph> </Section> <Section position="8" start_page="57" end_page="58" type="metho"> <SectionTitle> 5 Experiments with the Penn </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="57" end_page="58" type="sub_section"> <SectionTitle> Treebank Corpus </SectionTitle> <Paragraph position="0"> The corpus used in the experiments was the part of the Wall Street Journal which had been processed in the Petal %'eebank project 1 (Marcus el: al., 1993). This corpus consists of English texts collected from the Wall Street Journal from editions of the late eighties. It contains approximately one million words. This corpus was automatically labelled, analyzed and manually checked as described in (Marcus et 31., 1993).</Paragraph> <Paragraph position="1"> There are two kinds of labelling: a POStag labelling and a syntactic labelling. The size of the vocalmlary is greater than 25,000 diil'erent words, the POStag vocabulary is composed of 45 labels 2 and the syntactic vocabulary is composed of 14 labels.</Paragraph> <Paragraph position="2"> The corpus was divided into sentences according to the bracketing. In this way, we obtained a corpus whose main characteristics are shown in Table 1.</Paragraph> <Paragraph position="3"> corpus once it; was divided into sentences.</Paragraph> <Paragraph position="4"> No. of Av. Std. Min. Max.</Paragraph> <Paragraph position="5"> senten, length deviation length length 49,207 23.61 11.13 1 249 We took advantage of the category-based SCFGs estimated in a previous work (Simchez and Benedf, 1998). These SCFGs were estimated with sentences which had less than 15 words. Therefore, in this work, we assumed such restriction. The vocabulary size of the new corpus was 6,333 different words. For the experiments, the corpus was divided into a training corpus (directories 00 to 1.9) and a test corpus (directories 20 to 24). The characteristics of these sets can be seen in Table 2. The part of the (-orlms lal)(:led with l)()Stags was used to (:st;imate the p~wameters of tlm grammati(:al me(M, while the non-lad)e\](;(l part was u,s(',d i;o estimate th(; parameters (it&quot; the n-grmn lnodc.l. \Y=(~ now for the eXl)eriments wh(',n the senl:en(:(~s wi(;h more l;lmn 15 l)OSl;ags were r(;moved.</Paragraph> <Paragraph position="6"> Da.ta \[ No. of I Av. Std. \] l(,ngl h deviation S(~, J; SellI;ell. \]_.</Paragraph> <Paragraph position="7"> ~l.i;st . 2,295 1 .1~ 3.55 The 1)a.rmn(%er,q of a 3-grmn too(l(;1 were ('~stimatcd with the softw~re tool des('rit)('.(1 in (l/,osenfehl, 1995) :t. W(~ u,qed tlm linear ini;(',rl/ola.tion ~qmooth t('~(:lmiqu(~ SUpl)orted by ,;hi,~ tool. Th('~ o1:l;-of-v(/(:al)lflary words wcr('~ groul/e(l in the same (:\]as,~ and w(u'e used in th(~ ('omt)ula1;ion of i;\]~('~ perl)h~'xity. ~.l'h(,. I:(~sl ,~(:I l)(~rl)l('~xity with t:his mo(lel was 180.4.</Paragraph> <Paragraph position="8"> T\]w, values ()f ('.xt)r(~,qsi()n (3) wure (:()ml)ut(!(t from the t~:.gged and l:on-l:agg(:(t i)alq; O1' \[;lie, l:raining corpus. In or(h'a' to avoid mill values, the m~seen (~'vents wer('~ lal)ele(1 with ~ Sl)Ccial symbol 'w' wlfich did not a pl)ear in (;he i, s.,:h -C/ 0, Vc C (/, whtu'(~ (/ was I:h('~ set ()f (:at('.g()ri(~s. That is, all th(: ('at(',gori(~s could g(',n(:rat(', i;\]:(: uu-,q('x'~n evenI, This l)rolml)ility took a. v(~ry small valu(; (s(',v('.ral ()rd(;rs of magnii;u(l('~ h'~,qs tlmn minw~v,c(:c l)r('wlc), where V was the \.'o('almlary of the tra.ining corpus), and (liffer(mI; vahte.~ of this i)robability did not chang('~ tlm r(~sults. The i)aramet('~rs of an initial ergodic SCFG were estimated with each one of the estimation methods mentioned in Se('tion 3. This SCFG had 3,374 rules, (:omt)osed fl'om 45 terminal syml)ols (the numl)er of l)()Stags) and \]d non-terminal symbols (the nmnber of synl;a('l;i(: labels). The prol)z~l)ilitics were rmidolnly gem'a'ated mid t;hree different seeds were tested, lint only one of them is reported given that the resuits were very similar. The training (:orlms was the labele(l part of the des('ril)ed (:orlm,q. The test set with the SCFC, estimated with the methods mentioned in Section 3.</Paragraph> <Paragraph position="9"> 7 1 /'~ vs l a,\ s l ,ot, I\ sb I Once we had estimated the lmramei;ers of the defined model, we applied expression (1) 1)y using the IAI,\] algorithm m~d the VI.\[/,I algorithm in ('~xt)w, ssion (d). Th(; test set lWa'l)lexitly that was ol)I;ained in flmction t)f (t %r difl't'a'(:nt; estinm~tion algorithms (VS, kVS, lOb mid VSb) can be seen in \]rig. 1.</Paragraph> <Paragraph position="10"> In the best case, the tn'ot)osed l~mguage model el)rained more than a. 30% inlI)rOVellle:l|; OVer re,~ults ol)taincd 1)y the 3-gram lmlguagc motM (s(w. ~l'at)le d). This result wa,q ol)t;ainc.d wh('~n th(: SCFG usl;imat(~d with the lOb algorit;hm An important aspect to note is (;hat the weight; of the grmmnatical part was approximat;ely 50%, which means that this part provided iml)ori;mlI; inform~tion to the language model.</Paragraph> </Section> </Section> class="xml-element"></Paper>