File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2099_metho.xml

Size: 25,906 bytes

Last Modified: 2025-10-06 14:07:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2099">
  <Title>A Statistical Theory of Dependency Syntax</Title>
  <Section position="3" start_page="0" end_page="684" type="metho">
    <SectionTitle>
2 Generating Dependency Trees
</SectionTitle>
    <Paragraph position="0"> To describe a tree structure T, we will use a string notation, introduced in (Gorn, 1962), for the nodes  of Clio tree, where the node name sl)ecifi0s the path fi'om the root no(le ~ to the node in question, If (Jj is a node of the tree T, with j C N+ and (/J E N~, then q5 is also a node of the trc'e T and ()j is a child of 4.</Paragraph>
    <Paragraph position="1"> ltere, N+ denotes the set of positive integers {1,2,...} and N~_ is the set of strings over N+. 'l'his lncans that the label assigned to node ()j is a dependent of the label assigned to node (J. The first dependency tree of Figure 1 is shown in l!'igure 2 using this notation.</Paragraph>
    <Paragraph position="2"> We introduce three basic random variables, which incrementally generate the tree strucl;ure: * PS(4)) = l assigns the \]al)el l to node 4), where l is a iitlc|etis, i.e., it; is drawn frol,-I the set of strings over t, he so.t of i, okens.</Paragraph>
    <Paragraph position="3"> * &amp;quot;D(Oj) = d indicates t.h(~ dep(:ndency type d linking the label of node OJ to its regent, the label Of node 4'&amp;quot; * V(C/,) = v indica.tes that node 7, has exactly v child nodes.</Paragraph>
    <Paragraph position="4"> Note the use of ~(~/J) = 0, rather than a partitioning of the labels into terlninal and nonterminal syml)ols, to indicate that ~ is a leaf node.</Paragraph>
    <Paragraph position="5"> l,et D be the (finite) set of l)ossible dependency types. We next introduce the composite variables .T(()) ranging over the power bag* N D, indicating the bag of dependency types of dJ's children:</Paragraph>
    <Paragraph position="7"> Figure 3 encodes the dependency ti;ee ()1' Figure 2 accordingly. We will ignore the last cohunn \['or now.</Paragraph>
    <Paragraph position="8"> 1 A bag (mull'set) can contain several tokens of the Smlm type. We denote sets {...}, \]Jags \[...\] and ordered tuples {...), \]Jill, over\]o+'id O~ (~&gt; etc, We introduce the probabilities</Paragraph>
    <Paragraph position="10"> These l~robabilities are tyl)ically model parameters, o1' further decomposed into such. lJ~(@j) is the probability of the label PS(4~J) of a node given the label PS(4') of its regent and the dependency type &amp;quot;D(0j) linking them. l{eh~ting Eft, j) and PS(0) yiekls lexical collocation statistics and including D((~j) makes the collocation statistics lexical-fimetional. Pm(~0 is the probability of the bag of' dependency types Y(0) of a ,,ode Rive,, its label PS(4J) and its relation D(#)) to its regent. This retleets the probability of the label's vM oncy, or lexieal-fimctional eoml)lement , and of op1.ional adjuncts. Including D(q)) makes this probability situa.ted in taking its current role into accounl.. These allow us to define the tree probal)ility</Paragraph>
    <Paragraph position="12"> wiiere the 1)roduct is taken over the set. of nodes .A/&amp;quot; of the tree.</Paragraph>
    <Paragraph position="13"> \Y=e generate the random variables PS and S using a top-down stochastic process, where PS(()) is gonerated I)efore Y(O). The probal)ility of the conditioning material of l~(Oj) is then known from Pc(O) and 19((,), and that of Sg(4,j) is known froln \]'PS(OJ) and lJ:n(O). F'igure 3 shows the process generating the dependency tree' of Figure 2 by reading the PS and .7:- colunms downwards in parallel, PS before Y:</Paragraph>
    <Paragraph position="15"/>
  </Section>
  <Section position="4" start_page="684" end_page="686" type="metho">
    <SectionTitle>
3 String Realization
</SectionTitle>
    <Paragraph position="0"> '\]'he string realization cannot be uniquely determined from the tree structure. 'lb model the string-realization process, we introduce another fundamental random w~riable $(()), which denotes the string  associated with node 0 and which should not be confused with the node label PS(()). We will introduce yet another fundamental randoln variable \]v4(~)) in Section 3.2, when we accommodate crossing dependency links. In Section 3.1, we present a projectivc stochastic dependency gralnlnar with an expressive power not exceeding that of stochastic context-free grammars.</Paragraph>
    <Section position="1" start_page="685" end_page="685" type="sub_section">
      <SectionTitle>
3.1 Projective Dependency Grammars
</SectionTitle>
      <Paragraph position="0"> We let the stochastic process generating the PS and .7vtu'iM)les be as described above. We then define tile stochastic string-realization process by letting tile 8(~5) variables, given C/'s label 1(40 and the bag of strings s(()j) of ~5's child nodes, randomly permute and concatenate them according to the probability distributions of the modeh</Paragraph>
      <Paragraph position="2"> The latter equations should be interpreted as defining the randorn variable 8, rather than specifying its probability distribution or some possible outcome.</Paragraph>
      <Paragraph position="3"> This means that each dependent is realized adjacent to its regent, where wc allow intervening siblings, and that we thus stay within the expressive power of stochastic context-free grammars.</Paragraph>
      <Paragraph position="4"> We define the string-realization probability ~beAr and the tree-string probability as</Paragraph>
      <Paragraph position="6"> The stochastic process generating the tree structure is as described above. We then generate the string variables S using a bottom-up stochastic process. Figure 3 also shows the process realizing the surface string John ate beans fl-om the dependency tree of Figure 2 by reading the S column upwards:</Paragraph>
      <Paragraph position="8"> Consider cMeulating tile striug probability at node  1. Ps is the probability of the particular permut~ttion observed of the strings of the children and the  say lhat John ate? 1M)el of the node. To overcome the sparse-data problem, we will generalize over the actual strings of tile children to their dependency types. For example, s(subj) denotes the string of the subject child, regardless of what it actually might be.</Paragraph>
      <Paragraph position="10"> This is the probability of the permutation (s(subj), ate, s(dobj)) of the bag \[s(subj), aic, s(dobj)\] given this bag and the fact that we wish to tbrm a main, declarative clause. This example highlights the relationship between the node strings and both Sallssure's notion of constituency and tile l)ositiolml schemata of, amongst others, l)idrichsen.</Paragraph>
    </Section>
    <Section position="2" start_page="685" end_page="686" type="sub_section">
      <SectionTitle>
3.2 Crossing Dependency Links
</SectionTitle>
      <Paragraph position="0"> To accommodate long-distance dependencies, we allow a dependent to be realized adjacent to the label of rely node that dominates it, immediately or not. For example, consider the dependency tree of Figure 4 tbr the sentence l/Vhat beans did Ma'Jw say that John ate? as encoded in Figure 5. Ilere, What beans is a dependent of that arc, which in turn is a dependent of did say, and What beans is realized between did and sag. This phenomenon is called movement in conjunction with phrase-structure gramm~rs. It makes the dependency grammar nonprojective, since it creates crossing dependency links if the dependency trees also depict the word order.</Paragraph>
      <Paragraph position="1"> We introduce variables A//(~) that randomly select from C(4)) a, subbag CM(4,) of strings passed up to ()'s regent:</Paragraph>
      <Paragraph position="3"> .sag that dohn ate? q'he rest of the strings, Cs(C/), are realized here:</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="5" start_page="686" end_page="687" type="metho">
    <SectionTitle>
3,3 Discontinuous Nuch;i
</SectionTitle>
    <Paragraph position="0"> We generalize the scheme to discontinuous nuclei by allowing 8(C/) to i,mert the strings of C~.(~5) anywhere in 1(C/): e</Paragraph>
    <Paragraph position="2"> Tllis means that strings can only l)e inserted into ancestor labels, ,lot into other strings, which enforces a. type of reverse islaml constraint. Note how in Figure 6 John is inserted between that and ate to form the subordina, te clause that John atc.</Paragraph>
    <Paragraph position="3"> We define tile string-realization probability ,b6 ar and again define the tree-string prol)ability</Paragraph>
    <Paragraph position="5"> mutation, q~snihre's original implicit definition of a nucleus actually does not require that the order be preserved when realizing it; if has catch is a nucleus, so is eaten h.as. This is obviously a useflfl feature for nlodeling verb chains in G erln&amp;n subordinate clauses.</Paragraph>
    <Paragraph position="6"> 'lb avoid derivational ambiguity when generating a tree-string pair, i.e., have more than one derivation generate tile same tree-string pair, we require that no string be realized adjacent to the string of any node it was passed u 1) through. This introduces the l)raetica.l problem of ensuring that zero probability mass is assigned to all derivations violating this constraint. Otherwise, the result will be approxima.ting the parse probabi\]ity with a derivation probability, as described in detail in (Samuelsson, 2000) based on the seminal work of (Sima'an, 1996). Schemes like (Alshawi, 1996) tacitly make this approximation.</Paragraph>
    <Paragraph position="7"> The tree-structure variables PS and be are generated just as before. ~Y=e then generate the string variables 8 and Ad using a bottom-up stochastic process, where M(C/)is generated before 8(C/). 'l.'he probability of the eonditkming material of \]o~ (C/) is then known either from the top-down process or from I'M(C/j) and Pa(C/j), and that of INTO)is known either from the top-down process, or from 15v4(C/), \[)dgq(4)j) and 1~(C/j). The coherence of S(~) a.nd f14(~/)) is enforced by explicit conditioning.</Paragraph>
    <Paragraph position="8"> Figure 5 shows a top-down process generating the dependency tree of Figure &lt;1; the columns PS and be should be read downwards in parallel, L; before b e. Figure 6 shows a bottom-up process generating the string l/Vhat beans did Mary say that dohn at(:? from the dependency description of Figure 5. The colltlll,lS d~v4 and S should be read upwards in parallel, 2t4 before $.</Paragraph>
    <Section position="1" start_page="686" end_page="687" type="sub_section">
      <SectionTitle>
3.4 String Merging
</SectionTitle>
      <Paragraph position="0"> We have increased the expressive power of our dependency gramma.rs by nlodifying tile S variables, i.e., by extending the adjoin opera.lion. In tile first version, the adjoin operation randomly permutes the node label and the strings of the child nodes, and concatenates the result. In the second version, it randondy inserts the strings of the child nodes, and any moved strings to be rea.lized at tile current node, into the node label.</Paragraph>
      <Paragraph position="1"> The adjoin operation can be fln:ther refined to allow handling an even wider range of phenomena, such as negation in French. Here, the dependent string is merged with the label of the regent, as ne ... pas is wrapped around portions of the verb phrase, e.g., Ne me quitte pas!, see (Brel, 195.(t). Figure 7 shows a dependency tree h)r this. In addition to this, the node labels may be linguistic abstractions, e.g.</Paragraph>
      <Paragraph position="2"> &amp;quot;negation&amp;quot;, calling on the S variables also for their surface-string realization.</Paragraph>
      <Paragraph position="3"> Note that the expressive power of the grammar depends on the possible distributions of the string probabilities IN. Since each node label can be moved and realized at the root node, any language can be recognized to which the string probabilities allow assigning the entire probablity mass, and the gralnmar will possess at least this expressive power.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="687" end_page="689" type="metho">
    <SectionTitle>
4 A Computational Rendering
</SectionTitle>
    <Paragraph position="0"> A close approximation of the described stochastic model of dependency syntax has been realized as a type of prohabilistic bottom-up chart parser.</Paragraph>
    <Section position="1" start_page="687" end_page="687" type="sub_section">
      <SectionTitle>
4.1 Model Specialization
</SectionTitle>
      <Paragraph position="0"> The following modifications, which are really just specializations, were made to the proposed model for efficiency reasons and to cope with sparse data.</Paragraph>
      <Paragraph position="1"> According to Tesni6re, a nucleus is a unit that contains both tile syntactic and semantic head and that does not exhihit any internal syntactic structure. We take the view that a nucleus consists of a content word, i.e., an open-class word, and all flmction words adding information to it that could just as well have been realized morphologically. For example, the definite article associates definiteness with a. word, which conld just has well have been manifested in the word form, as it is done in North-Germanic languages; a preposition could be realized as a loca.tional or temporal inflection, as is done in Finnish. The longest nuclei we currently allow are verb chains of the form that have been eoten, as in John knows that lhe beans have been eaten.</Paragraph>
      <Paragraph position="2"> The 5 r variables were decomposed into generating the set of obligatory arguments, i.e., the valency or lexical complement, at once, as in the original model.</Paragraph>
      <Paragraph position="3"> Optional modifiers (adjuncts) are attached through one memory-less process tbr each modifier type, resuiting in geometric distributions for these. This is the same separation of arguments and adjuncts as that employed by (Collins, 1997). However, the PS variables remained as described above, thus leaving the lexieal collocation statistics intact.</Paragraph>
      <Paragraph position="4"> The movement probability was divided into three parts: the probability of moving the string of a particular argument dependent from its regent, that of a moved dependency type passing through a particular other dependency type, and that of a dependency type landing beneath a particular other dependency type. The one type of movement that is not yet properly handled is assigning arguments and adjuncts to dislocated heads, as in What book did John read by Chomsky? The string-realization probability is a straight-forward generalization of that given at the end of Section 3.1, and they m:e defined through regular expressions. Basically, each unmoved dependent string, each moved string landed at. the cur- null and Did John eat xxm? rent node, and each token of the nucleus labeling the current node are treated as units that are randomly permuted. Whenever possible, strings are generalized to their dependency types, but accurately modelling dependent order in French requires inspecting tile actual strings of dependent clitics. Open-class words are typically generalized to their word class.</Paragraph>
      <Paragraph position="5"> String merging only applies to a small class of nuclei, where we treat tile individual tokens of the dependent string, which is typically its label, as separate units when perfornfing tile permutation.</Paragraph>
    </Section>
    <Section position="2" start_page="687" end_page="688" type="sub_section">
      <SectionTitle>
4.2 The Chart Parser
</SectionTitle>
      <Paragraph position="0"> The parsing algorithm, which draws on the Co&amp;et(asanli-Younger (CI(Y) algorithm, see (Younger, 1967), is formulated as a prohabilistic deduction scheme, which in turn is realized as an agenda-driven chart-pa.rser. The top-level control is similar to that of (Pereira and Shieher, 1987), pp. 196-210. The parser is implemented in Prolog, and it relies heavily on using set and bag operations as primitives, utilizing and extending existing SICStus libraries.</Paragraph>
      <Paragraph position="1"> The parser first nondeterministically segments the input string into nuclei, using a lexicon, and each possible lmcleus spawns edges tbr the initial chart.</Paragraph>
      <Paragraph position="2"> Due to discontinuous nuclei, each edge spans not a single pair of string positions, indicating its start and end position, \])tit a set of such string-position pairs, and we call this set an index. If the index is a singleton set, then it is continuous. We extend the notion of adjacent indices to be any two non-overlapping indices where one has a start position that equals an end position of the other.</Paragraph>
      <Paragraph position="3"> The lexicon contains intbrmation about the roles (dependency types linking it to its regent) and valencies (sets s of types of argument dependents) that are possible for each nucleus. These are hard constraints. Unknown words are included in nuclei in a judicious way and the resulting nuclei are assigned all reasonable role/valency pairs in the lexicon. For example, the parser &amp;quot;correctly&amp;quot; analyzes tile sentences Did John xxx beans? and Did John eat xxx? as shown in Figure 8, where xxx' is not in the lexicon.</Paragraph>
      <Paragraph position="4"> For each edge added to the initial chart, the lexicon predicts a single valency, but a set of alternative roles. Edges arc added to cover all possible valenal)ue to the uniqueness principle of arguments, these are sets, rather than bags.</Paragraph>
      <Paragraph position="5">  ties for each nucleus. The roles correspond to tim &amp;quot;goal&amp;quot; of dotted items used ill traditional cha.rt parsing, and the unfilled valency slots play the part of the &amp;quot;l)ody&amp;quot;, i.e., the i)ortion of the \]{IlS \['ol\]owing the dot that renlailis to I)e found. If an argunl_ent is attached to the edge, the corresponding valency slot is filled in the resulting new odg(;; no arg~llnlont ea.ll be atta.ched to a.n edge llnless tllere is a (;orrespon(ling unfilled witency slot for it, or it is licensed by a lnoved arguln0nt, l,'or obvions reasons, the lexicon ca.nnot predict all possible combinations of adjuncts for each nuehms, and in fact predicts none at all.</Paragraph>
      <Paragraph position="6"> There will in general be nmltiple derivations of any edge with more than ()no del)endent , but the parser avoids adding dul)licate edges to tlt(? chart in the same way as a. traditional chart l)arser does.</Paragraph>
      <Paragraph position="7"> The l&gt;arser enll)loys a. l)a(:ked l)arse foresl. (PI)I! ') to represent the set of all possible analyses and the i)robalfility of each analysis is recoverable I\]:om the PPI!' entries. Since optional inodifiers are not 1)re dieted by the lexicon, the chart does not (:onl, a.ii~ any edges that ('orrespon(t directly to passive edges ill traditional chart parsing; at any point, an ad.lun('t C~ll always be added to an existing edge to form a new edge. In sonic sense, though, tile 1)1)1 '' nodes play tlie role all' passive edges, since the l)arser never attempts to combine two edges, only Olle ('xlgc and one I)l)l! ' lio(le, and the la.tter will a.lways 1)e a. dependent el'the fornier, directly, or indirectly tlirough the lists of n:iovcd dependents. 'l'he edge and l)l)l i' node to be Colnl)ined ai'e required to \]lave adjacent indices, and their lnlion is the index of tile now edge.</Paragraph>
      <Paragraph position="8"> The lnain point in using a l)acked parse forest, is to po'rI'orni local ~tiiil)iguity packing, which lneans a.b stracting over difl);ren(-es ill intc'rnal stlFlletlllye that do not lllalL, t(;r for fllrth(~,r \])arsilig. \Y=hen attching a I)PF no(-l(~' to SOlllo edgc ;_is a direct or indirect dependent, the only relewuit teatnres are its index, its nucleus, its role a.nd its moved dependents. Oilier features necessary for recovering the comph;tc&amp;quot; analysis are recorded in the P1)F entries of the node, bnt arc not used for parsing.</Paragraph>
      <Paragraph position="9"> To indicate the alternative that no more dependents are added to an edge, it is converted into a set of PPF updates, where each alternative role of the edge adds or updates one PPF entry. When doing this, any unfilled valency slots are added to the edge's set of moved arguments, which in turn is inherited by the resulting PPF update. '.\['lie edges are actually not assigned probabilities, since they contain enough information to derive the appropriate l)robabilities once they are converted into I)I)F entries. '1'o avoid the combinatorial explosion el' unrestricted string merging, we only allow edges with continuous indices to be converted into PI)I! ' 0ntries, with the exception of a very limited class of lexically signMed nnelei, snch as the nc pas, nc jamais, etc., scheme of French negation.</Paragraph>
    </Section>
    <Section position="3" start_page="688" end_page="689" type="sub_section">
      <SectionTitle>
4.3 Printing
</SectionTitle>
      <Paragraph position="0"> As Ot)l)osed to traditional chart parsing, meaningful upper and lower 1)ounds of the supply and demand for the dependency types C&amp;quot; the &amp;quot;goal&amp;quot; (roles) and &amp;quot;body&amp;quot; (wdency) of each edge can 1)e determined From the initial chart, which allows performing sophis(,icated pruning. The basic idea is that if some edge is proposed with a role that is not sought outside its index, this role can safely be removed. For example, the word me could potentially be an indirect object, but if there is no other word in the inl)ut string that can have a.n indirect object as an argument, this alternative can be discarded.</Paragraph>
      <Paragraph position="1"> 'Phis idea is generalized to a varia.nt of pigeonhole reasoning, in the v(;in of If wc select this role or edge, then ~here are by necessity too few or too many of some del)endcncy tyl)e sought or Cl'ered in the chart.</Paragraph>
      <Paragraph position="2"> or alternatively If wc select this nucleus or edge, then we cannot span the entire input string.</Paragraph>
      <Paragraph position="3"> Pruning is currently only al)plied to the initial chart to remove logically inq&gt;ossible alternatives and used to filter out impossible edges produced in the prediction step. Nonetheless, it reduces parsing times by an order of magnitude or more tbr many of the test examples. \]t would however be possible to apply similar ideas to interniittently reinove alternatives that are known 1:o be suboptimal, or to \]leuristically prtllie unlik(;ly searcll branches.</Paragraph>
      <Paragraph position="4">  We have proposed a generative, statistical t.iieory of dependency syntax, based on TesniSrc's classical theory, that models crossing dependency links, discontinuous nuclei and string merging. The key insight was to separate the tree-generation and string-realization processes. The model has been realized as a type of probabilistie chart parser. The only other high-fidelity computational rendering of Tesnitre's dependency syntax that we are aware of is that of (rl.'apanainen and J fi.rvinen, 1997), which is neither generative nor statistical.</Paragraph>
      <Paragraph position="5"> The stochastic model generating dependency trees is very similar to other statistical dependency models, e.g., to that of (Alshawi, 1996). Formulating it using Gorn's notation and the L; and 2&amp;quot; variables, though, is concise, elegant; and novel. Nothing prevents conditioning the random variables on arbitrary portions of Clio 1)artial tree generated this far, using, e.g., maximum-entrol)y or decision-tree models to extract relevant t~atnres of it; there is no difference  in principle between our model and history-based parsing, see (Black el; al., 1993; Magerman, 1995).</Paragraph>
      <Paragraph position="6"> The proposed treatment of string realization through the use of the ,5 and A4 variables is also both truly novel and important. While phrase-structure grammars overemphasize word order by making the processes generating the S variables deterministic, Tesni6re treats string realization as a secondary issue. We tind a middle ground by nsing stochastic processes to generate the S and Ad variables, thus reinstating word order as a parameter of equal importance as, say, lexical collocation statistics. It is however not elevated to the hard-constraint status it enjoys in phrase-structure grammars.</Paragraph>
      <Paragraph position="7"> Due to the subordinate role of string realization in classical dependency grammar, the technical problems related to incorporating movement into the string-realization process have not been investigated in great detail. Our use of the 54 variables is motivated partly by practical considerations, and partly by linguistic ones. The former in the sense that this allows designing efficient parsing algorithms for handling also crossing dependency links. The latter as this gives us a quantitative handle on the empirically observed resistance against crossing dependency links. As TesniSre points out, there is locality in string realization in the sense that dependents tend to be realized adjacent to their regents. This fact is reflected by the model parameters, which also model, probabilistically, barrier constraints, constraints on landing sites, etc. It is noteworthy that treating movelnent as in GPSG, with the use of the &amp;quot;slash&amp;quot; l~ature, see (Gazdar et al., 1985), pp. 137-168, or as is done in (Collins, \]997), is the converse of that proposed here for dependency grammars: the tbrmer pass constituents down the tree, the 54 variables pass strings up the tree.</Paragraph>
      <Paragraph position="8"> The relationship between the proposed stochastic model of dependency syntax and a number of other prominent stochastic grammars is explored in detail in (Samuelsson, 2000).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML