File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1081_metho.xml

Size: 14,833 bytes

Last Modified: 2025-10-06 14:07:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1081">
  <Title>A Stochastic Parser Based on a Structural Word Prediction Model Shinsuke MORI, Masafumi NISHIMURA, Nobuyasu ITOH,</Title>
  <Section position="3" start_page="558" end_page="560" type="metho">
    <SectionTitle>
2 Stochastic Language Model based
on Dependency
</SectionTitle>
    <Paragraph position="0"> In this section, we propose a stochastic /angua.ge model based on dependency. Unlike most stochastic language models %r a. parser, our model is theoreticMly based on a hidden Markov model. In our model a. sentence is predicted word by word fi'om left to right and the state at ea.ch step of prediction is basieMly a. sequence of words whose modifiea.nd has not appeared yet. According to a psyeholinguistic report on la.nguage structure (Yngve, 1960), there is an upper limit on the number of the words whose inodificaJ~ds ha.ve not appeared yet. This limit is determined by tim mmfloer of slots in sl~ort-term memory, 7 :k 2 (Miller, 1956). With this limitation, we Call design a pa.rser based on a linite state model.</Paragraph>
    <Section position="1" start_page="558" end_page="559" type="sub_section">
      <SectionTitle>
2.1 Sentence Model
</SectionTitle>
      <Paragraph position="0"> '\]'he I)asic idett of our model is that each word would be better predicted from the words that have a. dependency rela.tion with the. word to be predicted than from the preceding two words (l.ri-gram model).</Paragraph>
      <Paragraph position="1"> Let us consider the complete structur('~ of the sentence in /&amp;quot;igure I and a \]tyl)otheti(:al struetm:e after the 1)rediction of tile lifth word at the top of Figure 2. In this hypothetica.l st;ructure, there are three trees: one root-only tree (/q, eomposc'd of wa) a.nd two two-node trees (l. conta.ining 'wz and 'w2, and l(, containing w4 an(1 'w5). If the last two trees (&amp; and le) de4)end on the word we, this word may better be predicted from thes(~ two trees. I&amp;quot;rom this I)oint of view, our model Ill-st: predicts the trees del)cnding on the next word and then l)redicts the next word from thes(&amp;quot; trees.</Paragraph>
      <Paragraph position="2"> Now, let us make the tbllowing definitions in order to explain our model formally.</Paragraph>
      <Paragraph position="3"> * 11~ ~-ttqlv2...'tt)~ : a, seqllcnce of words. \]\]ere a. word is define(l as a, pair consisting of a string of alplmbetic chara.cters and a, pa.rt of speech (e.g.</Paragraph>
      <Paragraph position="4"> the/DT).</Paragraph>
      <Paragraph position="5"> * ti = lil2&amp;quot;&amp;quot;lk, : a, sequence of parrtiM parse trees covering the i-pretix words ('w~ w~... wi).</Paragraph>
      <Paragraph position="6"> * t + trod t~- : subsequences of ti ha.ving a.nd not having a. dependeney relation with the next word respectively. In .h~p~mese, like many other langua.ges, no two dependency relations cross each other; thus tl = t~ t +, * (t w) : a tree with 'w as its root a.nd t as the sequence of all subtrees connected to the root.</Paragraph>
      <Paragraph position="7"> After wi+l has been predicted from the trees depending on it (t+), there a.re trees renmin-</Paragraph>
      <Paragraph position="9"> * Jhna:r : upper limit on the munber number of words whose moditicands have not appeared yet.</Paragraph>
      <Paragraph position="10"> Under these definitions, our stochastic language model is defined as follows:</Paragraph>
      <Paragraph position="12"> where 7;, is all possible bhm.ry trees with n nodes.</Paragraph>
      <Paragraph position="13"> lie,','., the first fi~.ctor, (P(wilt+ 1)), is ca.lled the word prediction model and the second, (P (~'~1 } ti-1 ))' the state prediction model. Let us consider Figure 2 aga.in. At. the top is the state just a.fter the prediction of the tilth word. The state prediction model then predicts the pa.rtial purse trees depending on the next word a.mong all partial parse trees, as shown in the second figure. Finally, the word prediction model predicts the next word Dora the partial parse trees depending on it.</Paragraph>
      <Paragraph position="14"> As described above, there may be an upper limit on the number of words whose modificands ha.ve not yet appeared. To put it in a.nother way, the length of the sequence of l)artial parse trees (ti) is limited.  There%re, if the depth of the partial parse tree is also limited, the number of possible states is limited. Under this constraint, our model can be considered as a hidden Markov model. In a hidden Marker model, the first factor is called the output probability and the second, the transition probability.</Paragraph>
      <Paragraph position="15"> Since we assmne that no two dependency relations cross each other, the state prediction model only has to predict the mmaber of the trees depending on the ,text word. Tln, s S'(t+_,lt,._,) = ~':'(ylt~_~) where y is the number of trees in the sequence t?_ 1, According to the above assumption, the last y partial parse trees depend on the i-th word. Since the nmnber of possible parse trees for a word sequence grows exponentially with the number of the words, the space of the sequence of partial parse trees is huge even if the length of the sequence is limited.</Paragraph>
      <Paragraph position="16"> 'PShis inevitably causes a data-sparseness problem.</Paragraph>
      <Paragraph position="17"> To avoid this problern, we limited the number of levels of nodes used to distinguish trees. In our experiment, only the root and its children are checked to identify a partial parse tree. Hereafter, we represent \]JLL to denote this model, in which the lexicon of the first level and thai; of the second level are considered. Thus, in our experiment each word and the number of partial parse trees depending on it are predicted by a sequence of partial parse trees that take account of the nodes whose depth is two or less.</Paragraph>
      <Paragraph position="18"> It is worth noting that if the dependency structure of a sentence is linear -- that is to say, if each word depends on the next word, -- then our model will be equivalent to a word tri-gram model.</Paragraph>
      <Paragraph position="19"> We introduce an interpolation technique (Jelinek et al., 1991) into our model like those used in n-gram models. By loosening tree identification regulations, we obtain a more general model. For example, if we check only the POS of the root and the I?OS of its children, we will obtain a model similar to a POS tri-gram model (denoted PPs' hereafter). If we check the lexicon of the root, but not that of its children, the model will be like a word bi-gram model (denoted PNL hereafter). As a smoothing method, we can interpolate the model PLL, similar to a word tri-gram model, with a more general model, PPP or PNL. In our experiment, as the following formula indicates, we interpolated seven models of different generalization levels:</Paragraph>
      <Paragraph position="21"> where X in PYx is the check level of the first level of the tree (N: none, P: POS, L: lexicon) and Y is that of the second level, and lG,c-gr&lt;~m is the uniform distribution over the vocabulary W (-\])~U,O--gI'D,I~\](*/)) = l/IWl).</Paragraph>
      <Paragraph position="22"> The state predictio,, model also interpola.ted in the salne way. in this case, the possible events are y = 1,2,..., Ym(~x, thus; /~a,0-gr&lt;~m =</Paragraph>
      <Paragraph position="24"/>
    </Section>
    <Section position="2" start_page="559" end_page="559" type="sub_section">
      <SectionTitle>
2.2 Parmneter Estimation
</SectionTitle>
      <Paragraph position="0"> Since our model is a hidden Markov model, the parameters of a model can l)e estimated from at. row corpus by EM algorithm (13amn, 1972). With this algorithm, the probability of the row corpus is expected to be maxinfized regardless of the structure of ea.ch sentence. So the obtained model is not always appropriate for a. parser.</Paragraph>
      <Paragraph position="1"> In order to develop a model appropriate for a parser, it is better that the parameters are estimated from a syntactically annotated corlms by a maximmn likelihood estimation (MI,E) (Meriaklo, 1994:) as follows:</Paragraph>
      <Paragraph position="3"> where f(x) represents the frequency of an event x in tile training corpus.</Paragraph>
      <Paragraph position="4"> The interpolation coeificients in the formula (2) are estimated by the deleted interpolation method (aelinek et al., 1991).</Paragraph>
    </Section>
    <Section position="3" start_page="559" end_page="560" type="sub_section">
      <SectionTitle>
2.3 Selecting Words to be Lexiealized
</SectionTitle>
      <Paragraph position="0"> Generally speaking, a word-based n-gram model is better than a l&gt;OS-based 'n-gram model in terms of  predictive power; however lexica.lization of some infrequent words may be ha.rmfu\] beta.use it may c;mse a. data-sparseness problem. In a. practiea.1 tagger (I(upiec, \] 989), only the nlost, frequent \] 00 words a.re lexicalized. Also, in a, sta.te-ofthe-a.rt English pa.rser (Collins, 1997) only the words tha, t occur more tha,n d times in training data. are lexicalized.</Paragraph>
      <Paragraph position="1"> For this reason, our pa.rser selectn the words to be lexicalized at the time of lea.rning. In the lexicalized models described above (P/A;, I},L and f~VL), only the selected words a.re \]exica.lized. The selection criterion is parsing a.ccuracy (see section 4) of a. hekl-out corpus, a small part of the learning col pus excluded from l)a, ramcter cstima.tion. Thus only the words tliat a.re 1)redicte(1 to improve the parsing a.Ccllra.oy of the test corpilS&gt; or illlklloWll illpll{,&gt; i/3&amp;quot;e lexicalized. The algorithm is as follows (see l,'igurc  a): \]. In the initial sta.te a.ll words are in the class of their I)OS.</Paragraph>
      <Paragraph position="2"> 2. All words are sorted ill descending order of their  frequency, a.nd the following 1)rocens is executed for each word in this order: (a.) The word is lexicalizcd provisionally and the accura.cy el tile held-oul, corpus is (:;/lcilia.ted. null  (b) Ir a.n illiproven\]ont in observed, the word is 10xica.lized definitively.</Paragraph>
      <Paragraph position="3"> Tile result of this \]exica.liza.tion algoril.lun is used to identil~y a. \])a.rtia.l l)arse tree. That is to say, ()ill 3, Icxiealized wordn are distinguished in lexicalized models. It&amp;quot; IlO wordn Were nelcctxxl I:o be lexica/ized, {;hell</Paragraph>
      <Paragraph position="5"> nol;ing that if we try to .ioi,, a word with allo/,/ler wet(l, then this a,lgOlJithnl will be a, llorlna\] top-down c, lustcring a.igorithnl.</Paragraph>
    </Section>
    <Section position="4" start_page="560" end_page="560" type="sub_section">
      <SectionTitle>
2.4 Unknown Word Model
</SectionTitle>
      <Paragraph position="0"> To enable olir stocllastic la.nguage lnodel to handle unknowil words&gt; we added a.li ui/knowii word model based Oil a cha.ra,cter \])i-giPa,nl nio(M, lr the next word is not in the vocabula.ry, the n\]o(lel predicts its POS a.nd the llllklloWll word model predicts the string of the word as folkm, s:</Paragraph>
      <Paragraph position="2"> 1}'1&amp;quot;, a special character corresponding to a word l)oundary, in introduced so tha.t the ntlilt of the l)robability over all sl, rillgs is eqlla,\] to 71.</Paragraph>
      <Paragraph position="3"> In the l)ara.lneter cstima.tion described a.1)ove, a.</Paragraph>
      <Paragraph position="4"> learning corpus is divided into k parts. In our exi)erirnent, the vocabulary is composed of the wordn</Paragraph>
      <Paragraph position="6"> other words are used for 1)arameter entima, tion of the unknown word model. The l)i-gram l)robal)ility of the unknown word model of a. I)OS in estimated \['ronl the words among them and belonging to the POS as follows: 1 !l, o s (.~, i la: i - J ) M)~V .fP o s (* i, * i- ~ ) fPos (,,;~-,) The character I)i%ram model is also interl)ola.ted Wi/,\]I a llili-~r~iill illode\] and a Zel'O-~l'aill lllodc\]. The interl)olation coellicients are estinmi.d by the deleted interpolation method (,lelinek el. al., 1991).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="560" end_page="560" type="metho">
    <SectionTitle>
3 Syntactic Analysis
</SectionTitle>
    <Paragraph position="0"> (,el cJ dl.y, a. l)a.rscr may I)c considered an a module that recdvcs a. sequence of words annotated with a, I'()S and oul.putn its structm'e. Our parner, which includes a stochastic mflmown word model, however, is a.I)le to a.cc.el)t a cha.ra.ctc'r sequence as an input and execute segmenta.tion, POS tagging, and syntactic analysis nimultaneously I . In this section, wc exphfin our pa.rser, which is based on the language modal described in the preceding section.</Paragraph>
    <Section position="1" start_page="560" end_page="560" type="sub_section">
      <SectionTitle>
3.1 Sto('hastie Syntac|,ic Analyzer
</SectionTitle>
      <Paragraph position="0"> A syntactic analyzer, bancd on a. stochastic language model, ca.lculatc's the pa.rse tree (see Figure 1) with the highest probability for a given scquencc of characters x according to the following tbrmula.:</Paragraph>
      <Paragraph position="2"> where w(T) represents the concatenation of the word string in the syntactic trek T. P(T) in the last line is a stochastic language model, in our parser, it is the probability of a parse tree T defined by the stochastic dependency model including the unknown word model described in section 2.</Paragraph>
      <Paragraph position="4"> where wlw2&amp;quot;. &amp;quot;wn = w(T).</Paragraph>
    </Section>
    <Section position="2" start_page="560" end_page="560" type="sub_section">
      <SectionTitle>
3.2 Solution Search Algorithm
</SectionTitle>
      <Paragraph position="0"> As shown in formula (3), our parser is based on a hidden Markov model. It follows that Viterbi algorithm is applicable to search the best solution. Viterbi algorithm is capable of calculating the best solution in O(n) time, where n is the number of input characters. null The parser repeats a state tra.nsition, reading characters of the input sentence from left to right. In order that the structure of the input sentence may be a tree, the number of trees of the final state tn must be 1 and no more. Among the states that satisfy this condition, the parser selects the state with the highest probability. Since our language model uses only the root and its children of a partial parse tree to distinguish states, the last state does not have enough information to construct the parse tree. The parser can, however, calculate the parse tree fi'om the sequence of states, or both the word sequence and the sequence of y, the number of trees that depend on the next word. Thus it memorizes these values at each step of prediction. After the most probable last state has been selected, the parser constructs the parse tree by reading these sequences fi:om top to bottom.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML