XML Viewer - c96-1058

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1058_metho.xml
Size: 25,862 bytes
Last Modified: 2025-10-06 14:14:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1058">
  <Title>Three New Probabilistic Models for Dependency Parsing: An Exploration*</Title>
  <Section position="3" start_page="0" end_page="343" type="metho">
    <SectionTitle>
2 Probabilistic Dependencies
</SectionTitle>
    <Paragraph position="0"> It cannot be emphasized too strongly that a grammarital rcprcsentalion (de4)endency parses, tag sequen(-es, phrase-structure trees) does not entail any particular probability model. In principle, one couht model the distribution of dependency l)arses l()ur novel parsing algorithm a/so rescues depen dency from certain criticisins: &amp;quot;l)ependency granlmars ...are not lexicM, and (as far ~ as we know) lacl( a parsing algorithm of efficiency compara.ble to link grammars.&amp;quot; (LMferty et ;LI., 1992, p. 3)  in any uuml)er of sensible or perverse ways. 'l'h(~ choice of l;he right model is not a priori (A)vious.</Paragraph>
    <Paragraph position="1"> One way to huild a l)robabilistie grammar is to specify what sequences of moves (such as shift an(/ reduce) a parser is likely to make. It is reasonable to expect a given move to be correct about as often on test data. as on training data. This is tire philosophy behind stochastic CF(I (aelinek et a1.1992), &amp;quot;history-based&amp;quot; phrase-structure parsing (I-~lack et al., 1992), +m(I others.</Paragraph>
    <Paragraph position="2"> IIowever, i)rol)ability models derived from parsers sotnetimes focus on i,lci(lental prope.rties of the data. This utW be the case for (l,alli'.rty et M., 1992)'s model for link grammar, l\[' we were to adapt their top-(h)wn stochastic parsing str~tegy to the rather similar case of depen(lency grammar, we would find their elementary probabilities tabulating only non-intuitive aspects of the parse structure: Pr(word j is the rightmost pre-k chihl of word i \] i is a right-sl)ine st, rid, descendant of one of the left children of a token of word k, or else i is the parent of k, and i l)re(;edes j t)recerles k). :e While it is dearly necessary to decide whether j is a child of i, conditioning that (Iccision as alrove may not reduce its test entropy as mneh as a tnore linguistically perspienous condition woul(/.</Paragraph>
    <Paragraph position="3"> We believe it is ffttil,\['u\[ to de.sign prol&gt;al)ility models indel)en(letrtly of tit(' pa.rser. In this seelion, we will outline the three+ lexicalist, linguistically perspicuous, qualitatiw~ly different models that we have (leveloped a, nd tested.</Paragraph>
    <Section position="1" start_page="340" end_page="340" type="sub_section">
      <SectionTitle>
2.1 Model A: Bigram lexieal affinities
</SectionTitle>
      <Paragraph position="0"> N-gram tatters like (Church, 1988; .lelinek 1985; Kupiec 1992; Merialdo 1990) take the following view of \]row ~/, tagged sentctrce enters the worhl.</Paragraph>
      <Paragraph position="1"> I&amp;quot;irst, a se.(tuenee of tags is g('nexate.d aecordittg to a Markov l)rocess, with t.h(' random choice of e~ch tag conditioned ou the previous two tags. Second, a word is choseu conditional on each tag.</Paragraph>
      <Paragraph position="2"> Since our sentences have links as well as tags and words, suppose that afl;er the words are inserte(l, each senl;ence passes through a third step that looks at each pair of words and ran(lotnly decides whether to link them. For the resulting sentences to resemble real tort)era, the. probability that word j gets linked to word i should b(' le:~:i(:ally scnsilivc: it should depend on the (tag,word) pairs at both i and j.</Paragraph>
      <Paragraph position="3"> 'Fhe probability of drawing a given parsed sen(once froln the+ population may then be expressed 2This correspouds to l,Mi'erty el, al.'s central st~ttistk: (p. 4), l'r(m +-I L, le, l,r), in the case where i's pa.rent is to the left el i. i,j, k correspond to L, W, R respectively. Owing to the particular re(:ursiw~ strategy the p~trscr uses to bre+tk up the s(!\[tl,(?n(:e, the statistic would be measured ~ttld utilized only under the condition (lescribed above.</Paragraph>
      <Paragraph position="4">  (a) Ihe \[nice of Ihc sRu:k 1%11 I)T NN IN I)1' NN VIII) (b) tile price uf. the stock R'II \]YI&amp;quot; NN IN I)T NN Viii) t,'igure 3: (++)Th(, ,:orrect parse. (b) A cotnmon cr,or  if the model ignores arity.</Paragraph>
      <Paragraph position="5"> as (1) in \[,'igure 2, where the random wMable Lij G {0, 1} is t iff word i is the parent of word j. Expression (1) assigns a probability to e.very possible tag-a.nd-link-annotated string, and these l)robabilities sunl to one. Many or the annotated strings exhibit violations such as crossing links and multiple parents which, iftheywcreallowed, wouhl let all the words express their lexical prefe.rences independently and situttlta.ne:ously. We SiAl)ulate that the model discards fl'om the popula+tion tiny illegal structures that it generates; they do not appear in either training or test data. Therefore, the parser described below \[inds the likeliest legal structure: it maximizes the lexical preferences of (l) within the few hard liuguistic coush'ainls itnlrosed by the del)endency formalism.</Paragraph>
      <Paragraph position="6"> In practice, solrre generalization or &amp;quot;coarsenlug&amp;quot; of the conditionM probabilities in (1) heaps to avoid tile e.ll~ets of undertrMning. For exalHph'., we folk)w standard prn(-tice (Church, 1988) in n-gram tagging hy using (3) to al)proxitllate the lit'st term in (2). I)ecisions al)out how much coarsenittg t,o lie are+ o1' great pra(-t, ieal interest, b ut t hey (lel)etM on the training corpus an(l tnay l)e olnitted from a eonc&lt;'.t)tuM discussion of' the model. 'Fhe model in (I) can be improved; it does not (:aptrlr(&amp;quot; the fact that words have arities. For ex+Unl)h.' , lh.e price of lh.c sleek fell (l&amp;quot;igure 3a) will tyl&gt;ically 1)e nlisanalyzed under this model. Since stocks often fall, .sleek has a greater affinity f&lt;&gt;r fl:ll than lbr @ llen&lt;:e stock (as w&lt;'.ll as price) will en&lt;l tt\[) t&gt;ointittg to the verl&gt; ./'(ell (lqgure 31&gt;), result, hit in a double subject for JNI and \[eavitlg of childless. 'l'o Cal)i.nre word aril, ies an(l othe+r stil&gt;cal,&lt;,gr)rizalion I'aets, we must recognize that the. chihh:ert of a word like J~ll are not in(le4)ende.nt of each other.</Paragraph>
      <Paragraph position="7"> 'File sohttion is to nlodi/'y (t) slightly, further conditioning l,lj on the number and/or type of children of i that already sit between i and j. This means that in I, he parse of Figure 3b, the link price -+ \]?~11 will be sensitive to the fact that fell already has a ok)set chihl tagged as a noun (NN). Specifically, tire price --+ fell link will now be strongly disfavored in Figure '3b, since verbs rarely Lalw~ two N N del)endents to the left. By COllt;rast, price --&gt; fell is unobjectionable in l!'igure 3a, rendering that parse more probable. (This change (;an be rellected in the conceptual model, by stating that tire l,ij decisions are Hla(le ill increasing order of link length li--Jl and are no longer indepen(lent.)</Paragraph>
    </Section>
    <Section position="2" start_page="340" end_page="341" type="sub_section">
      <SectionTitle>
2.2 Model B: Seleetional i)references
</SectionTitle>
      <Paragraph position="0"> In a legal dependency l)axse, every word except for the head of the setrtence (tile EOS mark) has  Pr'(words, tags, links) =/','(words, tags). Pr(link presences and absences I words, tags) (1) I-\[ I t om(i + 1), twom(i + 2)). I\] I two,.d(i), two,'dO)) ('e) l&lt;i&lt;n l &lt;_i,j &lt;n l'v(tword(i) \] tword(i + 1), tword(i + 2)) ~ l','(tag(i) I tag(i + 1), tag(i + 2)). P,'(word(i) I tag(/)) (a) Pr(words, tags, links) c~ Pr(words, tags, preferences) =/'r(words, tags). Pr(preferences \] words, t~gs) (4) \]-I l',.(twom(i) I two d(i + 1), t o,'d(i + 2)). H I two,.d(i)) 1 &lt;i&lt;n t&lt;i&lt;n / 1 +#right-kids(i) '~ Pv(words, t+gs, links)= II { 1-\[ P,.(two,.d(kid+(i))I t,gj +dd+_,(i) ),t+o,'d(i)) l&lt;i&lt;n \c=-(\]-k#left+kids(i)),eT~0 kid~q_ 1 if c &lt; 0  j are tokens, then tword(i) represents the pair (tag(i), word(i)), and L,j C {0, 1} i~ ~ ill&amp;quot; i is the p~m:nt of j. exactly one parent. Rather than having the model select a subset of the ~2 possible links, as in model A, and then discard the result unless each word has exactly one parent, we might restrict the model to picking out one parent per word to begin with. Model B generates a sequence of tagged words, then specifies a parent or more precisely, a type of parent for each word j.</Paragraph>
      <Paragraph position="1"> Of course model A also ends up selecting a parent tbr each word, but its calculation plays careful politics with the set of other words that happen to appear: in the senterl(;C: word j considers both the benefit of selecting i as a parent, and the costs of spurning all the other possible parents/'.Model B takes an appro;~ch at the opposite extreme, and simply has each word blindly describe its ideal parent. For example, price in Figure 3 might insist (with some probability) that it &amp;quot;depend on a verb to my right.&amp;quot; To capture arity, words probabilistically specify their ideal children as well: fell is highly likely to want only one noun to its left.</Paragraph>
      <Paragraph position="2"> The form and coarseness of such specifications is a parameter of the model.</Paragraph>
      <Paragraph position="3"> When a word stochastically chooses one set of requirements on its parents and children, it is choosing what a link grammarian would call a disjuuct (set of selectional preferences) for the word.</Paragraph>
      <Paragraph position="4"> We may thus imagine generating a Markov sequence of tagged words as before, and then independently &amp;quot;sense tagging&amp;quot; each word with a disjunct, a Choosing all the disjuncts does not quite specify a parse, llowever, if the disjuncts are sufficiently specific, it specifies at most one parse. Some sentences generated in this way are illegal because their disjuncts cannot be simultaneously satisfied; as in model A, these sentences are said to be removed fi'om the population, and the probabilities renormalized. A likely parse is therefore one that allows a likely and consistent aln our implementation, the distribution over possible disjuncts is given by a pair of Markov processes, as in model C.</Paragraph>
      <Paragraph position="5"> set of sells(', tags; its probability in the population is given in (4).</Paragraph>
    </Section>
    <Section position="3" start_page="341" end_page="343" type="sub_section">
      <SectionTitle>
2.3 Model C: Recursive generation
</SectionTitle>
      <Paragraph position="0"> The final model we prol)ose is a generation model, as opposed l;o the comprehension mo(lels A and B (and to other comprehension modc, ls such as (l,afferty et al., 1992; Magerman, 1995; Collins, 1996)). r\]'he contrast recalls an ohl debate over spoken language, as to whether its properties are driven by hearers' acoustic needs (coml)rehension) or speakers' articulatory needs (generation). Models A and B suggest that spe~kers produce text in such a way that the grammatical relations can be easily decoded by a listener, given words' preferences to associate with each other and tags' preferences to follow each other. But model C says that speakers' primary goal is to flesh out the syn tactic and conceptual structure \['or each word they utter, surrounding it with arguments, modifiers, and flmction words as appropriate. According to model C, speakers should not hesitate to add extra prepositionM phrases to a noun, even if this lengthens some links that are ordinarily short, or leads to tagging or attachment mzJ)iguities.</Paragraph>
      <Paragraph position="1"> The generation process is straightforward. Each time a word i is added, it generates a Markov sequence of (tag,word) pairs to serve, as its left children, and an separate sequence of (tag,word) pairs as its right children. Each Markov process, whose probabilities depend on the word i and its tag, begins in a speciM STAI{T state; the symbols it generates are added as i's children, from closest to farthest, until it re~ches the STOP state, q'he process recurses for each child so generated. This is a sort of lexicalized context-free model.</Paragraph>
      <Paragraph position="2"> Suppose that the Markov process, when gem crating a child, remembers just the tag of the child's most recently generated sister, if any. Then the probability of drawing a given parse fi'om the population is (5), where kid(i, c) denotes the cthclosest right child of word i, and where kid(i, O) = START and kid(i, 1 + #,'ight-kids(i)) = STOP.</Paragraph>
      <Paragraph position="3">  has one pa,rcnt, lcss cndwor(I; its sul)sl)+tn (b) lists two. (c &lt; 0 in(h'xes l('ft children,) 'Fhis may bc thought o\[&amp;quot; as a, non-linca.r l;rigrrmt model, where each t;agg('d woM is genera, l,ed 1)ascd on the l)a.r ('nl, 1,~gg(:d wor(l and ;t sistx'r tag. 'l'he links in the parse serve Lo pick o,tt; t, he r('Jev;mt t,rit:;t+a,n~s, and a.rc' chosen 1;o g('t; l,rigrams t, lml, ot)l, imiz(~ t, hc glohM t,a,gging. 'l'tt;tl; the liuks also ha.t)l)en t;o ;ulnot,;:d,('. useful setnant;ic rela, tions is, from this t&gt;crsl)ective, quil.e a(-cidcn{,a,l.</Paragraph>
      <Paragraph position="4"> Note that the revised v(',rsiol~ of ulo(h:t A uses prol)a, bilit, ics /&amp;quot;@ink to chihl I child, I)arenl,, closer-('hihh:en), where n.)(le\] (; uses l'v(link 1,o child \] parent,, eloscr-chil(h'en). 'l'his is I)c(:;,.t~se model A assunw.s 1,lu~l, I,h('. (:hild was i)reviously gencrat, ed I)y a lin(;a,r l)roc('ss, aml all t;hal, is necess+u'y is t,o li.k 1,o it,. Model (~ a, cl,ually g(,n(;ral,es t, he chihl in the process o\[' liuking to il,.</Paragraph>
      <Paragraph position="5">  3 Bottom-\[)i) Dependency Parsing lu this sec.tAon we sket(:h our dependel.'y l)m'sing ;dg;oril, hnl: ~ novel dytmJni('.-l)rogr;mJndng m('.l,hod 1,o assetnhle l, he mosl, l&gt;rol)a,ble+ i)a.rse From the bet,tom Ul). The algori@m ++(l(Is one link at a l, ime, nmking il; easy to multiply oul, the hie(lois' l)rolm hility l'a(:t, ors. It, also enforces I,hc special direc Lion;dil,y requiremenl~s of dependency gra.nnnar, 1;he l)rohibitions on cycles mM nlultiple par('nl,s. 4 '\['\]10 liic.t\]tod llsed is similar t;o tie C K Y met.hod of cont.exl,-fr('e l)~rsing, which combines aJIMys(:s  of shorl, er substrings into analys&lt;:s of progressively longer ones. Multiple a.na.lyses It;we l, hc s~tnm signature if t;hey are indistinguishal&gt;le i, their M)ility to (;Otlll)ill(? wit,h other analyses; if so, the parser disca,rds all but, the higlmsl,-scoring one.</Paragraph>
      <Paragraph position="6"> CI,:Y t'cquit',;s ()(?,.:t~ ~) t.i,,,,' +utd O(,,.:'.~) sp+.'.,;, where n is the lenglih of 1,he s(mtcn(:c and ,s is a,n Upl)(;r bouiM on signal;ures 1)er subsl;ring.</Paragraph>
      <Paragraph position="7"> Let us consider dependency parsing in t;his framework. ()he mighl; guess that each substa'ing ;mMysis shottld bc t+ lcxicM tree ;+ tagged he;ulword plus aJl Icxical sulfl;rees dependc'nt, upon i/,. (See l&amp;quot;igure 111.) llowew, r, if a. o:/tst,il, cnt s * 11,Mmled depend(reties a,re possible, a.nd a minor va,ria, nt ha.ndles the sitnplcr (:~tse of link gra.tnltl;-u', hideed, abstra.ctly, the a.lgorithm rescmbies ;t c\](,.aamr, bottom-up vcrsiou of the top-down link gr~tmm~tr pa,rser develol)ed independently by (l,Ml:'crty et aJ., 1992).</Paragraph>
      <Paragraph position="8"> ...... ~fz_ ....... ~ ~ ......................... .+ _~.._._ ....... , %i.y- - ....&lt;</Paragraph>
      <Paragraph position="10"> I&amp;quot;iglll'e 5' The ass,:,mbly of a span c from two sm:LIIcr spaus (a a,nd b) ~tml a cove.ring link. Only . is miuimal.</Paragraph>
      <Paragraph position="11"> probabilistic behavior depends on iL~ he.adword (;he lcxicMisL hypoiJmsis titan dilt'erent;ly hc~:uhxl a.na.lyses need dilt'erenI; sigmrtures. There a.re al.</Paragraph>
      <Paragraph position="12"> lca+sl, k of t,hcsc for a, s/ibst;rhl,~ Of le..e;IJI k, whence Ge houn(t ,,~ :: t: = ~(u), giving: ;i l, illm COml)lexit,y of tl(,s). ((~ollins, 19.%)uses t,his t~(,'.&amp;quot;-')a, lgo ril, lml (lireclJy (t,ogel,h('r wil, h l)runiug;).</Paragraph>
      <Paragraph position="13"> \'\% I)rOl)OSe a,u aJl,ermtl, ive a,I)l)roa.('h l, ha, I, I)re serves the OOP) hound, hls~ca(t of analyzing sul) st,ri.gs as lcxical t, rees that, will be linked t, ogoJ,her in(,o la, rgcr h'~xica, I l, rees, t, lic I)arsc, r will ana, lyze I,hc'ln a,s uon-const,itm'.nt, sl)a:n.s t;haJ, will he cou cat;cm~t,ed into larger spans. A Sl)a,n cousisl;s el'</Paragraph>
      <Paragraph position="15"> (:el)l, possibly the last; ;t list, of all del.'mle.cy \]i, ks muong the words in l, hc Sl)an; and l)erha, l)S s()lue other inl'ornml,ic, n carried a, long in t;lu, siren's signaJ, mc. No cych's, n,ull, iph' l)a, rcnts, or (','ossi,tg liul.:s are Mlowed in the Sl)a.u, and each Jut,re'hal word of' l, he Sl&gt;ml must ha, vc ~ Ira.rein iu the q);m+ Two sl&gt;a, ns at&lt;&gt; illustraJ,ed in I&amp;quot;igure d, 'l'hese dia,gra.nts a, rc I,yl)ica,l: a, Sl)a,n el&amp;quot; a (Icpendct.:y l)a+rsc may consist, of oil,her a I)a+rcn(,less endword and some o\[' its des(:cn&lt;hmt,s on one side (l&amp;quot;igtu'c 4a), or two parent, less cndwords, with a.ll t,he right &amp;&amp;quot; s('(mda, nLs oF(me and all l;hc M'I, dcscen(hml,s of I, Ii(~ el, her (lq,e;urc 4b). '1'tl(.' im, uilAon is I, haJ, L\]le. illl,('A' hal part; of a, span is gra, nmmtically iuert: excel)l, Ior tit(', cmlwords dachsh, u~td mid play, l;hc struc lure o1' ea,ch span is irrelewml, I,o t,\]1(; Sl&gt;Cm's al)ility t,o cotnbinc iu ful,ure, so sl)a, ns with different inter1ml strucl, tu'e ca,n colnlmte to bc t;hc I)est,-scoring span wil, h a, lm,rticula,r signal;urc.</Paragraph>
      <Paragraph position="16"> 117 sl)an a, ctMs on t,he saanc word i l;\[ha, l, st,al'l,s span b, t,h(;n law I)a,rs(er tries l;o c(&gt;ml&gt;ine I,hc l, wo spans I)y cove, red-(-(mvatcnation (l&amp;quot;igur(; 5).</Paragraph>
      <Paragraph position="17"> The I,wo Col)ies of word i arc idc.nt, i\[ied, a, fl,er which a M'l,waM or rightwaM cove\]\['ing link is ol)l;ionMly added I)ct,wceu t,h(' c.dwor(ts of t,h0. ,.&gt;.v sf)a,n. Any tlepcudcncy parse ca, n I)c built Ill:) hy eovered-coitca, tena, i;ion. When the l)a,rser covcrcd('O\]lCaJ,enat,cs (~ trod b, it, ol)l, ains up to IJtrce new SlmUS (M't, wa, rd, right,war(I, and no coveritlg \]ink).</Paragraph>
      <Paragraph position="18"> The &lt;'o',,ered-(:oncaJ,cnal,ion of (+ a.nd b, I'ornfing (', is 1)arrcd unh;ss it, tricots terra, in simple test;s: * . must, I)e minimal (not, itself expressihle ++s a concaLenal,ion of narrower spaus). This prcvenLs us from assend&gt;ling c in umltiple ways.</Paragraph>
      <Paragraph position="19"> * Since tim overlapping word will bc int;ertta,l to c, it; Illll81\[, ha, ve ?g parenl; in cxa,(;L\]y oile of a told b.  leftmost word of a or the rightmost word of b has a parent. (Violating this condition leads to either multiple parents or link cycles.) Any sufficiently wide span whose left endword has a parent is a legal parse, rooted at the EOS mark (Figure 1). Note that a span's signature must specify whether its endwords have parents.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="343" end_page="343" type="metho">
    <SectionTitle>
4 Bottom-Up Probabilities
</SectionTitle>
    <Paragraph position="0"> Is this one parser really compatible with all three probability models? Yes, but for each model, we must provide a way to keep tr~tck of probabilities as we parse. Bear in mind that models A, B, and C do not themselves specify probabilities for all spans; intrinsically they give only probabilities for sentences.</Paragraph>
    <Paragraph position="1"> Model C. Define each span's score to be the product of all probabilities of links within the span. (The link to i from its eth child is associated with the probability Pr(...) in (5).) When spans a and b are combined and one more link is added, it is easy to compute the resulting span's score: score(a), score(b)./degr(covering link)) When a span constitutes a parse of the whole input sentence, its score as just computed proves to be the parse probability, conditional on the tree root EOS, under model C. The highest-probability parse can therefore be built by dynamic programming, where we build and retain the highest-scoring span of each signature.</Paragraph>
    <Paragraph position="2"> Model B. Taking the Markov process to generate (tag,word) pairs from right to left, we let (6) define the score of a span from word k to word (?.</Paragraph>
    <Paragraph position="3"> The first product encodes the Markovian probability that the (tag,word) pairs k through g- 1 are as claimed by the span, conditional on the appearance of specific (tag,word) pairs at g, ~+1. ~ Again, scores can be easily updated when spans combine, and the probability of a complete parse P, divided by the total probability of all parses that succeed in satisfying lexical preferences, is just P's score. Model A. Finally, model A is scored the same as model B, except for the second factor in (6), SThe third factor depends on, e.g., kid(i,c- 1), which we recover fl'om the span signature. Also, matters are complicated slightly by the probabilities associated with the generation of STOP.</Paragraph>
    <Paragraph position="4"> 6Different k-g spans have scores conditioned on different hypotheses about tag(g) and tag(g + 1); their signatures are correspondingly different. Under model B, a k-.g span may not combine with an 6-~n span whose tags violate its assumptions about g and g + 1.</Paragraph>
  </Section>
  <Section position="5" start_page="343" end_page="344" type="metho">
    <SectionTitle>
5 Empirical Comparison
</SectionTitle>
    <Paragraph position="0"> We have undertaken a careful study to compare these models' success at generalizing from training data to test data. Full results on a moderate corpus of 25,000+ tagged, dependency-annotated Wall Street Journal sentences, discussed in (Eisner, 1996), were not complete hi; press time. However, Tables 1 2 show pilot results for a small set of data drawn from that corpus. (The full resnlts show substantially better performance, e.g., 93% correct tags and 87% correct parents fbr model C, but appear qualitatively similar.) The pilot experiment was conducted on a subset of 4772 of the sentences comprising 93,a~0 words and punctuation marks. The corpus was derived by semi-automatic means from the Penn Treebank; only sentences without conjunction were available (mean length=20, max=68). A randomly selected set of 400 sentences was set aside for testing all models; the rest were used to estimate the model parameters. In the pilot (unlike the full experiment), the parser was instructed to &amp;quot;back oil&amp;quot;' from all probabilities with denominators &lt; 10. For this reason, the models were insensitive to most lexical distinctions.</Paragraph>
    <Paragraph position="1"> In addition to models A, B, and C, described above, the pilot experiment evaluated two other models for comparison. Model C' was a version of model C that ignored lexical dependencies between parents and children, considering only dependencies between a parent's tag and a child's tag. This model is similar to the model nsed by stochastic CFG. Model X did the same n-gram tagging as models A and B (~. = 2 for the preliminary experiment, rather than n = 3), but did not assign any links.</Paragraph>
    <Paragraph position="2"> Tables 1 -2 show the percentage of raw tokens that were correctly tagged by each model, as well as the proportion that were correctly attached to  contage of tokens corrc0Lly attached Lo their paronl;s by each model.</Paragraph>
    <Paragraph position="3"> their parents. Per tagging, baseline per\[ol:lnance Wa, S I/leaSlli'ed by assigniug each word ill the test set its most frequent tag (i\[' any) \['roiii the trainlug set. Thc iinusually low I)aseliue t)crJ'orillance I:esults \['l'Olll kL conil)iuation of ;t sHiaJl l&gt;ilot Lr;~illing set and ;t Inil(lly (~xten(|e(I t~g set. 7 \Vc ol) served that hi the ka.ining set, detei:lniners n-lost colrinlonly pointed t.o the following; word, so as a parsing baseline, we linked every test dctcrnihler to the following word; likewise, wc linked every test prcpositior, to the preceding word, and so ()11, The l'JatterllS in the preliuli/lary data ~ti'e striking, with w:rbs showing up as all aFea el (lil\[iculty, alld with SOllle \]tlodcis cl&lt;;arly farillg bctter I,\[I;tll other. The siinplcst and \['astest uiodel, the l'(~cur-siw ~, generation uiodel (7, did easily i.he bcsl. ,job of &lt;'i-q)turing the dependency s/.ructurc ('l'able 2). It misattachcd t.hc fewest words, both overall aud in each categol:y. This suggcsts that sut)eategjo rization 1)rcferc\[lccs the only I'~Lctor ('onsidered by model (J I)lay a substantial role in I;he sti:uclure of Trcebank scntcn(-cs. (lndccd, tii(; erl;ors ill model I~, wliich pe:l:forHled worst across the bO~Lr(l, were very frequently arity erl:ors, where ttie desire of a chihl to ~Ltta(:h LO a 1)articular parent over-.</Paragraph>
    <Paragraph position="4"> calne the rchi(:i;ail(;e of tile \[)areiit to a(:(-el)t uiore children.) A good deal of the l,arsi0_g Sll(;(',ess of inoclel (7 seems to h~ve arisen from its k/iowle(lgc, of individ-tiff. words, as we cxpe(:ted. This is showfi by the vastly inl~rior l)Cl;forniaH('e o\[' I;}lc control, model (ft. On l;he ot\]ier hand, I)oth (7 an(l (J' were conlpetitivc with t\[10 oth0r UlOdCiS i~l; tagging. This shows that a t~Lg can 1)e predicted ~d)out as well \['rolri Lhe tags of its putative p;Lrel,t ;rod sil)\]in&lt;g as it ('an fiX)ill the \[~ags O\[&amp;quot; string-a(lja(:cnt words, eVell when there is ('onsideral)le e/;l:OF ill dcterinin-ing the parent and s\[bling.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML