File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1036_metho.xml
Size: 41,384 bytes
Last Modified: 2025-10-06 14:09:27
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1036"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 281-290, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Compiling Comp Ling: Practical Weighted Dynamic Programming and the Dyna Language[?]</Title> <Section position="3" start_page="281" end_page="281" type="metho"> <SectionTitle> 2 A Language for Deductive Systems </SectionTitle> <Paragraph position="0"> Any toolkit needs an interface. For example, FS toolkits offer a regular expression language. We propose a simple but Turing-complete language, Dyna, for specifying weighted deductive-inference algorithms. We illustrate it here by example; see http://dyna.org for more details and a tutorial.</Paragraph> <Paragraph position="1"> The short Dyna program in Fig. 1 expresses the inside algorithm for PCFGs (i.e., the probabilistic generalization of CKY recognition). Its 3 inference rules schematically specify many equations, over an arbitrary number of unknowns. This is possible bcause the unknowns (items) have structured names (terms) such as constit(&quot;s&quot;,0,3). They resemble typed variables in a C program, but we use variable instead to refer to the capitalized identifiers X, I, K, . . . in lines 2-4. Each rule gives a consequent on the left-hand side of the +=, which can be built by combining the antecedents on the right-hand side.1 Lines 2-4 are equational schemas that specify how to compute the value of items such as constit(&quot;s&quot;,0,3) from the values of other items. Using the summation operator +=, lines 23 say that for any X, I, and K, constit(X,I,K) is defined by summing over the remaining variables, as summationtextW rewrite(X,W)*word(W,I,K) +summationtext Y,Z,J rewrite(X,Y,Z)*constit(Y,I,J)*constit(Z,J,K). For example, constit(&quot;s&quot;,0,3) is a sum of quantities such as rewrite(&quot;s&quot;, &quot;np&quot;, &quot;vp&quot;)*constit(&quot;np&quot;,0,1)*constit(&quot;vp&quot;,1,3). The whenever operator in line 4 specifies a side condition that restricts the set of expressions in the sum (i.e., only when N is the sentence length).</Paragraph> <Paragraph position="2"> To fully define the system of equations, non-default values (in this case, non-zero values) should be asserted for some axioms at runtime. (Axioms, shown in bold in Fig. 1, are items that never appear 1Much of our notation and terminology comes from logic programming: term, variable, inference rule, antecedent/consequent, assert/retract, axiom/theorem.</Paragraph> <Paragraph position="3"> as a consequent.) If the PCFG contains a rewrite rule np - Mary with probability p(Mary |np)=0.005, the user should assert that rewrite(&quot;np&quot;, &quot;Mary&quot;) has value 0.005. If the input is John loves Mary, values of 1 should be asserted for word(&quot;John&quot;,0,1), word(&quot;loves&quot;,1,2), word(&quot;Mary&quot;,2,3), and ends at(3). Given the axioms as base cases, the equations in Fig. 1 enable deduction of values for other items.</Paragraph> <Paragraph position="4"> The value of the theorem constit(&quot;s&quot;,0,3) will be the inside probability bs(0,3),2 and the value of goal will be the total probability of all parses.</Paragraph> <Paragraph position="5"> If one replaces += by max= throughout, then constit(&quot;s&quot;,0,3) will accumulate the maximum rather than the sum of these quantities, and goal will accumulate the probability of the best parse.</Paragraph> <Paragraph position="6"> With different input, the same program carries out lattice parsing. Simply assert axioms that correspond to (weighted) lattice arcs, such as word(&quot;John&quot;,17,50), where 17 and 50 are arbitrary terms denoting states in the lattice. It is also quite straightforward to lexicalize the nonterminals or extend to synchronous grammars.</Paragraph> <Paragraph position="7"> A related context-free parsing strategy, shown in Fig. 2, is Earley's algorithm. These equations illustrate nested terms such as lists. The side condition in line 2 prevents building any constituent until one has built a left context that calls for it.</Paragraph> </Section> <Section position="4" start_page="281" end_page="283" type="metho"> <SectionTitle> 3 Relation to Previous Work </SectionTitle> <Paragraph position="0"> There is a large relevant literature. Some of the well-known CL papers, notably Goodman (1999), were already mentioned in section 1.1. Our project has three main points of difference from these.</Paragraph> <Paragraph position="1"> First, we provide an efficient, scalable, open-source implementation, in the form of a compiler from Dyna to C++ classes. (Related work is in SS7.2.) The C++ classes are efficient and easy to use, with statements such as c[rewrite(&quot;np&quot;,2,3)]=0.005 to assert axiom values into a chart named c (i.e., a deduc2That is, the probability that s would stochastically rewrite to the first three words of the input. If this can happen in more than one way, the probability sums over multiple derivations. 1. need(''s'',0) = 1. % begin by looking for an s that starts at position 0 2. constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) whenever ?need(Nonterm, I). % traditional predict step 3. constit(Nonterm/Needed,I,K) += constit(Nonterm/cons(W,Needed),I,J) * word(W,J,K). % traditional scan step 4. constit(Nonterm/Needed,I,K) += constit(Nonterm,cons(X,Needed),I,J) * constit(X/nil,J,K). % traditional complete step 5. goal += constit(&quot;s&quot;/nil,0,N) whenever ?ends at(N). % we want a complete s constituent covering the sentence 6. need(Nonterm,J) += constit( /cons(Nonterm, ), ,J). % Note: underscore matches anything (anonymous wildcard) as the axiom rewrite(&quot;np&quot;,cons(&quot;det&quot;,cons(&quot;n&quot;,nil))), a nested term. &quot;np&quot;/Needed is the label of a partial np constituent that is still missing the list of subconstituents in Needed. need(&quot;np&quot;,3) is derived if some partial constituent seeks an np subconstituent starting at position 3. As in Fig. 1, lattice parsing comes for free, as does training. tive database) and expressions like c[goal] to extract the values of the resulting theorems, which are computed as needed. The C++ classes also give access to the proof forest (e.g., the forest of parse trees), and integrate with parameter optimization code.</Paragraph> <Paragraph position="2"> Second, we fully generalize the agenda-based strategy of Shieber et al. (1995) to the weighted case--in particular supporting a prioritized agenda.</Paragraph> <Paragraph position="3"> That allows probabilities to guide the search for the best parse(s), a crucial technique in state-of-the-art context-free parsers.3 We also give a &quot;reverse&quot; agenda algorithm to compute gradients or outside probabilities for parameter estimation.</Paragraph> <Paragraph position="4"> Third, regarding weights, the Dyna language is designed to express systems of arbitrary, heterogeneous equations over item values. In previous work such as (Goodman, 1999; Nederhof, 2003), one only specifies the inference rules as unweighted Horn clauses, and then weights are added automatically in a standard way: all values have the same type W, and all rules transform to equations of the form c [?]= a1 [?] a2 [?] *** [?] ak, where [?] and [?] give W the structure of a semiring.4 In Dyna one writes these equations explicitly in place of Horn clauses (Fig. 1). Accordingly, heterogeneous Dyna programs, to be supported soon by our compiler, will allow items of different types to have values of different types, computed by different aggregation operations over arbitrary right-hand-side ex3Previous treatments of weighted deduction have used an agenda only for an unweighted parsing phase (Goodman, 1999) or for finding the single best parse (Nederhof, 2003). Our algorithm works in arbitrary semirings, including non-idempotent ones, taking care to avoid double-counting of weights and to handle side conditions.</Paragraph> <Paragraph position="5"> 4E.g., the inside algorithm in Fig. 1 falls into Goodman's framework, with <W,[?],[?]> = <R[?]0,+,[?]> --the PLUSTIMES semiring. Because [?] distributes over [?] in a semiring, computing goal is equivalent to an aggregation over many separate parse trees. That is not the case for heterogeneous programs.</Paragraph> <Paragraph position="6"> pressions. This allows specification of a wider class of algorithms from NLP and elsewhere (e.g., minimum expected loss decoding, smoothing formulas, neural networks, game tree analysis, and constraint programming). Although SS4 and SS5 have space to present only techniques for the semiring case, these can be generalized.</Paragraph> <Paragraph position="7"> Our approach may be most closely related to deductive databases, which even in their heyday were apparently ignored by the CL community (except for Minnen, 1996). Deductive database systems permit inference rules that can derive new database facts from old ones.5 They are essentially declarative logic programming languages (with restrictions or extensions) that are--or could be--implemented using efficient database techniques. Some implemented deductive databases such as CORAL (Ramakrishnan et al., 1994) and LOLA (Zukowski and Freitag, 1997) support aggregation (as in Dyna's +=, log+=, max=, . . . ), although only &quot;stratified&quot; forms of it that exclude unary CFG rule cycles.6 Ross and Sagiv (1992) (and in a more restricted way, Kifer and Subrahmanian, 1992) come closest to our notion of attaching aggregable values to terms.</Paragraph> <Paragraph position="8"> Among deductive or other database systems, Dyna is perhaps unusual in that its goal is not to support transactional databases or ad hoc queries, but rather to serve as an abstract layer for specifying an algorithm, such as a dynamic programming (DP) algorithm. Thus, the Dyna program already implicitly or explicitly specifies all queries that will be needed.</Paragraph> <Paragraph position="9"> This allows compilation into a hard-coded C++ implementation. The compiler's job is to support these lations in memory7 in a way that resembles hand-designed data structures for the algorithm in question. The compiler has many choices to make here; we ultimately hope to implement feedback-directed optimization, using profiled sample runs on typical data. For example, a sparse grammar should lead to different strategies than a dense one.</Paragraph> </Section> <Section position="5" start_page="283" end_page="286" type="metho"> <SectionTitle> 4 Computing Theorem Values </SectionTitle> <Paragraph position="0"> Fig. 1 specifies a set of equations but not how to solve them. Any declarative specification language must be backed up by a solver for the class of specifiable problems. In our continuing work to develop a range of compiler strategies for arbitrary Dyna programs, we have been inspired by the CL community's experience in building efficient parsers.</Paragraph> <Paragraph position="1"> In this paper and in our current implementation, we give only the algorithms for what we call weighted dynamic programs, in which all axioms and theorems are variable-free. This means that a consequent may only contain variables that already appear elsewhere in the rule. We further restrict to semiring-weighted programs as in (Goodman, 1999). But with a few more tricks not given here, the algorithms can be generalized to a wider class of heterogeneous weighted logic programs.8</Paragraph> <Section position="1" start_page="283" end_page="283" type="sub_section"> <SectionTitle> 4.1 Desired properties </SectionTitle> <Paragraph position="0"> Computation is triggered when the user requests the value of one or more particular items, such as goal.</Paragraph> <Paragraph position="1"> Our algorithm must have several properties in order to substitute for manually written code.</Paragraph> <Paragraph position="2"> Soundness. The algorithm cannot be guaranteed to terminate (since it is possible to write arbitrary Turing machines in Dyna). However, if it does terminate, it should return values from a valid model of the program, i.e., values that simultaneously satisfy all the equations expressed by the program.</Paragraph> <Paragraph position="3"> Reasonable completeness. The computation should indeed terminate for programs of interest to the NLP community, such as parsing under a probabilistic grammar--even if the grammar has dates, which arbitrarily modify one of the inputs to an aggregation. Non-dynamic programs require non-ground items in the chart, complicating both storage and queries against the chart. 1. for each axiom a, set agenda[a] := value of axiom a 2. while there is an item a with agenda[a] negationslash= 0 3. (* remove an item from the agenda and move its value to the chart *) 4. choose such an a 5. [?] := agenda[a]; agenda[a] := 0 6. old := chart[a]; chart[a] := chart[a][?][?] 7. if chart[a] negationslash= old (* only propagate actual changes *) 8. (* compute new resulting updates and place them on the agenda *) 9. for each inference rule &quot;c [?]= a1 [?] a2 [?]***[?] ak&quot; 10. for i from 1 to k 11. for each way of instantiating the rule's variables such that ai = a 12. agenda[c] [?]=</Paragraph> <Paragraph position="5"> out side conditions (see text).</Paragraph> <Paragraph position="6"> left recursion, unary rule cycles, or epsilon1-productions. This appears to rule out pure top-down (&quot;backwardchaining&quot;) approaches.</Paragraph> <Paragraph position="7"> Efficiency. Returning the value of goal should do only as much computation as necessary. To return goal, one may not need to compute the values of all items.9 In particular, finding the best parse should not require finding all parses (in contrast to Goodman (1999) and Zhou and Sato (2003)). Approximation techniques such as pruning and best-first search must also be supported for practicality.</Paragraph> </Section> <Section position="2" start_page="283" end_page="284" type="sub_section"> <SectionTitle> 4.2 The agenda algorithm </SectionTitle> <Paragraph position="0"> Our basic algorithm (Fig. 3) is a weighted agenda-based algorithm that works only with rules of the form c [?]= a1[?]a2[?]***[?]ak. [?] must distribute over [?].</Paragraph> <Paragraph position="1"> Further, the default value for items (line 1 of Fig. 1) must be the semiring's zero element, denoted 0.10 Agenda-based deduction maintains two indexed data structures: the agenda and the chart. chart[a] stores the current value of item a. The agenda holds future work that arises from assertions or from previous changes to the chart: agenda[a] stores an incremental update to be added (using [?]) to chart[a] in future. If chart[a] or agenda[a] is not stored, it is computation of goal to terminate even if the program as a whole contains some irrelevant non-terminating computation. Even in practical cases, the runtime of computing all items is often prohibitive, e.g., proportional to n6 or worse for a dense tree-adjoining grammar or synchronous grammar.</Paragraph> <Paragraph position="2"> 10It satisfies x [?] 0 = x,x [?] 0 = 0 for all x. Also, this algorithm requires [?] to distribute over [?]. Dyna's semantics requires [?] to be associative and commutative.</Paragraph> <Paragraph position="3"> taken to be the default 0.</Paragraph> <Paragraph position="4"> When item a is removed from the agenda, its chart weight is updated by the increment value. This change is then propagated to other items c, via rules of the form c [?]= *** with a on the right-hand-side.</Paragraph> <Paragraph position="5"> The resulting changes to c are placed back on the agenda and carried out only later.</Paragraph> <Paragraph position="6"> The unweighted agenda-based algorithm (Shieber et al., 1995) may be regarded as the case where</Paragraph> <Paragraph position="8"> the natural further generalization to any semiring.</Paragraph> <Paragraph position="9"> How is this a further generalization? Since [?] (unlike [?] and max) might not be idempotent, we must take care to avoid erroneous double-counting if the antecedent a combines with, or produces, another copy of itself.11 For instance, if the input contains epsilon1 words, line 2 of Fig. 1 may get instantiated as con-</Paragraph> <Paragraph position="11"> constit(&quot;np&quot;,5,5). This is why we save the old values of agenda[a] and chart[a] as [?] and old, and why line 12 is complex.</Paragraph> </Section> <Section position="3" start_page="284" end_page="284" type="sub_section"> <SectionTitle> 4.3 Side conditions </SectionTitle> <Paragraph position="0"> We now extend Fig. 3 to handle Dyna's side conditions, i.e., rules of the form c [?]= expression whenever boolean-expression. We discuss only the simple side conditions treated in previous literature, which we write as c [?]= a1[?]a2[?]***[?]akprime whenever ?bkprime+1 & *** & ?bk. Here, ?bj is true or false according to whether there exists an unweighted proof of bj.</Paragraph> <Paragraph position="1"> Again, what is new here? Nederhof (2003) considers only max= with a uniform-cost agenda discipline (see SS4.5), which guarantees that no item will be removed more than once from the agenda. We wish to support other cases, so we must take care that a second update to ai will not retrigger rules of which ai is a side condition.</Paragraph> <Paragraph position="2"> For simplicity, let us reformulate the above rule as c [?]= a1 [?] a2 [?] *** [?] akprime [?] ?bkprime+1 [?] *** [?] ?bk, where ?bi is now treated as having value 0 or 1 (the identity for [?]) rather than false or true respectively. 11An agenda update that increases x by 0.3 will increase r [?] x[?]x by r[?](0.6x+0.09). Hence, the rule x += r[?]x[?]x must propagate a new increase of that size to x, via the agenda. We may now use Fig. 3, but now any aj might have the form ?bj. Then in line 12, chart[aj] will be chart[?bj], which is defined as 1 or 0 according to whether chart[bj] is stored (i.e., whether bj has been derived). Also, if ai = ?a at line 11 (rather than ai = a), then [?] in line 12 is replaced by [?]?, where we have set [?]? := chart[?a] at line 5.</Paragraph> </Section> <Section position="4" start_page="284" end_page="285" type="sub_section"> <SectionTitle> 4.4 Convergence </SectionTitle> <Paragraph position="0"> Whether the agenda algorithm halts depends on the Dyna program and the input. Like any other Turing-complete language, Dyna gives you enough freedom to write undesirable programs.</Paragraph> <Paragraph position="1"> Most NLP algorithms do terminate, of course, and this remains true under the agenda algorithm.</Paragraph> <Paragraph position="2"> For typical algorithms, only finitely many different items (theorems) can be derived from a given finite input (set of axioms).12 This ensures termination if one is doing unweighted deduction with <W,[?],[?]> = <{T,F},[?],[?]> , since the test at line 7 ensures that no item is processed more than once.13 The same test ensures termination if one is searching for the best proof or parse with (say) <W,[?],[?]> = <R[?]0,min,+> , where values are negated log probabilities. Positive-weight cycles will not affect the min. (Negative-weight cycles, however, would correctly cause the computation to diverge; these do not arise with probabilities.) If one is using <W,[?],[?]> = <R[?]0,+,[?]> to compute the total weight of all proofs or parses, as in the inside algorithm, then Dyna must solve a system of nonlinear equations. The agenda algorithm does this by iterative approximation (propagating updates around any cycles in the proof graph until numerical convergence), essentially as suggested by Stolcke (1995) for the case of Earley's algorithm.14 Again, the computation may diverge.</Paragraph> <Paragraph position="3"> 12This holds for all Datalog programs, for instance.</Paragraph> <Paragraph position="4"> 13This argument does not hold if Dyna is used to express programs outside the semiring. In particular, one can write instances of SAT and other NP-hard constraint satisfaction problems by using cyclic rules with negation over finitely many boolean-valued items (Niemel&quot;a, 1998). Here the agenda algorithm can end up flipping values forever between false and true; a more general solver would have to be called in order to find a stable model of a SAT problem's equations.</Paragraph> <Paragraph position="5"> 14Still assuming the number of items is finite, one could in principle materialize the system of equations and call a dedicated numerical solver. In some special cases only a linear solver is needed: e.g., for unary rule cycles (Stolcke, 1995), or epsilon1-cycles in FSMs (Eisner, 2002).</Paragraph> <Paragraph position="6"> One can declare the conditions under which items of a particular type (constit or goal) should be treated as having converged. Then asking for the value of goal will run the agenda algorithm not until the agenda is empty, but only until chart[goal] has converged by this criterion.</Paragraph> </Section> <Section position="5" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 4.5 Prioritization </SectionTitle> <Paragraph position="0"> The order in which items are chosen at line 4 does not affect the soundness of the agenda algorithm, but can greatly affect its speed. We implement the agenda as a priority queue whose priority function may be specified by the user.15 Charniak et al. (1998) and Caraballo and Charniak (1998) showed that, when seeking the best parse (using min= or max=), best-first parsing can be extremely effective. Klein and Manning (2003a) went on to describe admissible heuristics and an A* framework for parsing. For A* in our general framework, the priority of item a should be an estimate of the value of the best proof of goal that uses a. (This non-standard formulation is carefully chosen.16) If so, goal is guaranteed to converge the very first time it is selected from the priority-queue agenda.</Paragraph> <Paragraph position="1"> Prioritizing &quot;good&quot; items first can also be useful in other circumstances. The inside-outside training algorithm requires one to find all parses, but finding the high-probability parses first allows one to ignore the rest by &quot;early stopping.&quot; In all these schemes (even A*), processing promising items as soon as possible risks having to reprocess them if their values change later. Thus, this strategy should be balanced against the &quot;topological sort&quot; strategy of waiting to process an item until its value has (probably) converged.17 Ulti15At present by writing a C++ function; ultimately within Dyna, by defining items such as priority(constit(&quot;s&quot;,0,3)). 16It is correct for proofs that incorporate two copies of a's value, or--more important--no copies of a's value because a is a side condition. Thus, it recognizes that a low-probability item must have high priority if it could be used as a side condition in a higher-probability parse (though this cannot happen for the side conditions derived by the magic templates transformation (SS6)). Note also that a's own value (Nederhof, 2003) might not be an optimistic estimate, if negative weights are present. 17In parsing, for example, one often processes narrower constituents before wider ones. But such strategies do not always exist, or break down in the presence of unary rule cycles, or cannot be automatically found. Goodman's (1999) strategy of building all items and sorting them before computing any weights is wise only if one genuinely wants to build all items. mately we hope to learn priority functions that effectively balance these two strategies (especially in the context of early stopping).</Paragraph> </Section> <Section position="6" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.6 Matching, indexing, and interning </SectionTitle> <Paragraph position="0"> The crucial work in Fig. 3 occurs in the iteration over instantiated rules at lines 9-11. In practice, we restructure this triply nested loop as follows, where each line retains the variable bindings that result from the unification in the previous line: 9. for each antecedent pattern ai that appears in some program rule r and unifies with a 10. for each way of simultaneously unifying r's remaining antecedent patterns a1,...ai[?]1,ai+1,...ak with items that may have non-0 value in the chart 11. construct r's consequent c (* all vars are bound *) Our implementation of line 9 tests a against all of the antecedent patterns at once, using a tree of simple &quot;if&quot; tests (generated by the Dyna-to-C++ compiler) to share work across patterns. As an example, a = constit(&quot;np&quot;,3,8) will match two antecedents at line 3 of Fig. 1, but will fail to match in line 4. Because a is variable-free (for DPs), a full unification algorithm is not necessary, even though an antecedent pattern can contain repeated variables and nested subterms.</Paragraph> <Paragraph position="1"> Line 10 rapidly looks up the rule's other antecedents using indices that are automatically maintained on the chart. For example, once constit(&quot;np&quot;,4,8) has matched antecedent 2 of line 3 of Fig. 1, the compiled code consults a maintained list of the chart constituents that start at position 8 (i.e., items of the form constit(Z,8,K) that have already been derived). Suppose one of these is constit(&quot;vp&quot;,8,15): then the code finds the rule's remaining antecedent by consulting a list of items of the form rewrite(X,&quot;np&quot;,&quot;vp&quot;). That leads it to construct consequents such as constit(&quot;s&quot;,4,15) at line 11. By default, equal terms are represented by equal pointers. While this means terms must be &quot;interned&quot; when constructed (requiring hash lookup), it enforces structure-sharing and allows any term to be rapidly copied, hashed, or equality-tested without dereferencing the pointer.18 Each of the above paragraphs conceals many decisions that affect runtime. This presents future opportunities for feedback-directed optimization, where profiled runs on typical data influence the compiler.</Paragraph> <Paragraph position="2"> 18The compiled code provides garbage collection on the terms; this is important when running over large datasets.</Paragraph> </Section> </Section> <Section position="6" start_page="286" end_page="287" type="metho"> <SectionTitle> 5 Computing Gradients </SectionTitle> <Paragraph position="0"> The value of goal is a function of the axioms' values.</Paragraph> <Paragraph position="1"> If the function is differentiable, we may want to get its gradient with respect to its parameters (the axiom values), to aid in numerically optimizing it.</Paragraph> <Section position="1" start_page="286" end_page="286" type="sub_section"> <SectionTitle> 5.1 Gradients by symbolic differentiation </SectionTitle> <Paragraph position="0"> The gradient computation can be derived from the original by a program transformation. For each item a in the original program--in particular, for each axiom--the new program will also compute a new item g(a), whose value is [?]goal/[?]a.</Paragraph> <Paragraph position="1"> Thus, given weighted axioms, the new program computes both goal and [?]goal. An optimization algorithm such as conjugate gradient can use this information to tune the axiom weights to maximize goal. An alternative is the EM algorithm (Dempster et al., 1977) for probabilistic generative models such as PCFGs. Luckily the same program serves, since for such models, the E count (expected count) of an item a can be found as a* g(a)/goal. In other words, the inside-outside algorithm has the same structure as computing the function and its gradient.</Paragraph> <Paragraph position="2"> The GRADIENT transformation is simple. For example,19 given a rule c += a1 [?] a2 [?] *** [?] akprime whenever ?bkprime+1 & *** & ?bk, we add a new rule</Paragraph> <Paragraph position="4"> akprime whenever ?ai, for each i = 1,2,...,kprime. (The original rule remains, since we need inside values to compute outside values.) This strategy for computing the gradient [?]goal/[?]a via the chain rule is an example of automatic differentiation in the reverse mode (Griewank and Corliss, 1991), known in the neural network community as back-propagation.</Paragraph> </Section> <Section position="2" start_page="286" end_page="287" type="sub_section"> <SectionTitle> 5.2 Gradients by back-propagation </SectionTitle> <Paragraph position="0"> However, what if goal might be computed only approximately, by early stopping before convergence (SS4.5)? To avoid confusing the optimizer, we want the exact gradient of the approximate function.</Paragraph> <Paragraph position="1"> To do this, we &quot;unwind&quot; the computation of goal, undoing the value updates while building up the gradient values. The idea is to differentiate an &quot;unrolled&quot; version of the original computation (Williams and Zipser, 1989), in which an item at</Paragraph> <Paragraph position="3"> c g(c) * [?]c/[?]ai by the chain rule.</Paragraph> <Paragraph position="4"> 1. for each a, gchart[a] := 0 and gagenda[a] := 0 (* respectively hold [?]goal/[?]chart[a] and [?]goal/[?]agenda[a] *) 2. gchart[goal] := 1 3. for each <a,[?],old> triple that was considered at line 8 of Fig. 3, but in the reverse order (* [?] is agenda[a] *) 4. G := gchart[a] (* will accumulate gagenda[a] here *) 5. for each inference rule &quot;c += a1 [?] a2 [?]***[?] ak&quot; 6. for i from 1 to k 7. for each way of instantiating the rule's variables such that ai = a 8. for h from 1 to k such that ah is not a side cond.</Paragraph> <Paragraph position="6"> and aj = a [?] if j negationslash= h and j = i chart[aj] otherwise 10. if h negationslash= i then gchart[ah] += g 11. if h [?] i and ah = a then G += g 12. gagenda[a] := G 13. chart[a] := old 14. return gagenda[a] for each axiom a the case <W,[?],[?]> = <R,+,[?]> . The proof is suppressed for lack of space.</Paragraph> <Paragraph position="7"> time t is considered to be a different variable (possibly with different value) than the same item at time t + 1. The reverse pass must recover earlier values. Our somewhat tricky algorithm is shown in Fig. 4. At line 3, a stack is needed to remember the sequence of <a, old,[?]> triples from the original computation.20 It is a more efficient version of the &quot;tape&quot; usually used in automatic differentiation. For example, it uses O(n2) rather than O(n3) space for the CKY algorithm. The trick is that Fig. 3 does not record all its computations, but only its sequence of items. Fig. 4 then re-runs the inference rules to reconstruct the computations in an acceptable order. This method is a generalization of Eisner's (2001) prioritized forward-backward algorithm for infinite-state machines. As Eisner (2001) pointed out, the tape created on the first forward pass can also be used to speed up later passes (i.e., after the numerical optimizer has adjusted the axiom weights).21 20If one is willing to risk floating-point error, then one can store only <a, old> on the stack and recover [?] as chart[a][?]old. Also, agenda[a] and gagenda[a] can be stored in the same location, as they are only used during the forward and the backward pass, respectively.</Paragraph> <Paragraph position="8"> 21In brief, a later forward pass that chooses a at Fig. 3, line 4 according to the recorded tape order (1) is faster than using a priority queue, (2) avoids ordering-related discontinuities in the objective function as the axiom weights change, (3) can prune by skipping useless updates a that scarcely affected goal (e.g.,</Paragraph> </Section> <Section position="3" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 5.3 Parameter estimation </SectionTitle> <Paragraph position="0"> To support parameter training using these gradients, our implementation of Dyna includes a training module, DynaMITE. DynaMITE supports the EM algorithm (and many variants), supervised and unsupervised training of log-linear (&quot;maximum entropy&quot;) models using quasi-Newton methods, and smoothing-parameter tuning on development data.</Paragraph> <Paragraph position="1"> As an object-oriented C++ library, it also facilitates rapid implementation of new estimation techniques (Smith and Eisner, 2004; Smith and Eisner, 2005).</Paragraph> </Section> </Section> <Section position="7" start_page="287" end_page="289" type="metho"> <SectionTitle> 6 Program Transformations </SectionTitle> <Paragraph position="0"> Another interest of Dyna is that its high-level specifications can be manipulated by mechanical sourceto-source program transformations. This makes it possible to derive new algorithms from old ones.</Paragraph> <Paragraph position="1"> SS5.1 already sketched the gradient transformation for finding [?]goal. We note a few other examples.</Paragraph> <Paragraph position="2"> Bounding transformations generate a new program that computes upper or lower bounds on goal, via generic bounding techniques (Prieditis, 1993; Culberson and Schaeffer, 1998). The A* heuristics explored by Klein and Manning (2003a) can be seen as resulting from bounding transformations.</Paragraph> <Paragraph position="3"> With John Blatz, we are also exploring transformations that can result in asymptotically more efficient computations of goal. Their unweighted versions are well-known in the logic programming community (Tamaki and Sato, 1984; Ramakrishnan, 1991). Folding introduces new intermediate items, perhaps exploiting the distributive law; applications include parsing speedups such as (Eisner and Satta, 1999), as well as well-known techniques for speeding up multi-way database joins, constraint programming, or marginalization of graphical models. Unfolding eliminates items; it can be used to specialize a parser to a particular grammar and then to eliminate unary rules. Magic templates introduce top-down filtering into the search strategy and can be used to derive Earley's algorithm (Minnen, 1996), to introduce left-corner filters, and to restrict FSM constructions to build only accessible states.</Paragraph> <Paragraph position="4"> Finally, there are low-level optimizations. Term constituents not in any good parse) by consulting gagenda[a] values that the previous backward pass can have written onto the tape (overwriting [?] or old).</Paragraph> <Paragraph position="5"> transformations restructure terms to change their layout in memory. We are also exploring the introduction of declarations that control which items use the agenda or are memoized in the chart. This can be used to support lazy or &quot;on-the-fly&quot; computation (Mohri et al., 1998) and asymptotic space-saving tricks (Binder et al., 1997).</Paragraph> <Section position="1" start_page="287" end_page="287" type="sub_section"> <SectionTitle> 7 Usefulness of the Implementation 7.1 Applications </SectionTitle> <Paragraph position="0"> The current Dyna compiler has proved indispensable in our own recent projects, in the sense that we would not have attempted many of them without it.</Paragraph> <Paragraph position="1"> In some cases, we were experimenting with genuinely new algorithms not supported by any existing tool, as in our work on dependency-lengthlimited parsing (Eisner and Smith, 2005b) and loosely syntax-based machine translation (Eisner and D. Smith, 2005). (Dyna would have been equally helpful in the first author's earlier work on new algorithms for lexicalized and CCG parsing, syntactic MT, transformational syntax, trainable parameterized FSMs, and finite-state phonology.) In other cases (Smith and Eisner, 2004; Smith and Smith, 2004; Smith et al., 2005), Dyna let us quickly replicate, tweak, and combine useful techniques from the literature. These techniques included unweighted FS morphology, conditional random fields (Lafferty et al., 2001), synchronous parsers (Wu, 1997; Melamed, 2003), lexicalized parsers (Eisner and Satta, 1999),22 partially supervised training `a la (Pereira and Schabes, 1992),23 and grammar induction (Klein and Manning, 2002). These replications were easy to write and extend, and to train via SS5.2.</Paragraph> </Section> <Section position="2" start_page="287" end_page="289" type="sub_section"> <SectionTitle> 7.2 Experiments </SectionTitle> <Paragraph position="0"> We compared the current Dyna compiler to hand-built systems on a variety of parsing tasks. These problems were chosen not for their novelty or interesting structure, but for the availability of existing well-tuned implementations.</Paragraph> <Paragraph position="1"> Best parse. We compared a Dyna CFG parser to the Java parser of Klein and Manning (2003b),24 and indexing and give a consistent 5-fold speedup. on the same grammar. Fig. 5 shows the results. Dyna's disadvantage is greater on longer sentences--probably because its greater memory consumption results in worse cache behavior.25 We also compared a Dyna CKY parser to our own hand-built implementation, C++PARSE.</Paragraph> <Paragraph position="2"> C++PARSE is designed like the Dyna parser but includes a few storage and indexing optimizations that Dyna does not yet have. Fig. 6 shows the 5-fold speedup from these optimizations on binarized-Treebank parsing with a large 119K-rule grammar. The sharp diagonal indicates that C++PARSE is simply a better-tuned version of the Dyna parser.</Paragraph> <Paragraph position="3"> These optimizations and others are now being incorporated into the Dyna compiler, and are expected 25Unlike Java, Dyna does not yet decide automatically when to perform garbage collection. In our experiment, garbage collection was called explicitly after each sentence and counted as part of the runtime (typically 0.25 seconds for 10-word sentences, 5 seconds for 40-word sentences).</Paragraph> <Paragraph position="4"> 99% 99.99% uniform 89.3 (4.5) 90.3 (4.6) after 1 EM iteration 82.9 (6.8) 85.2 (6.9) after 2 EM iterations 77.1 (8.4) 79.1 (8.3) after 3 EM iterations 71.6 (9.4) 73.7 (9.5) after 4 EM iterations 66.8 (10.0) 68.8 (10.2) after 5 iterations 62.9 (10.3) 65.0 (10.5) ent stage of training; later PCFGs are sharper. The table shows the percentage of agenda runtime (mean across 1409 sentences, and standard deviation) required to get within 99% or 99.99% of the true value of goal.</Paragraph> <Paragraph position="5"> to provide similar speedups, putting Dyna's parser in the ballpark of the Klein & Manning parser. Importantly, these improvements will speed up existing Dyna programs through recompilation.</Paragraph> <Paragraph position="6"> Inside parsing. Johnson (2000) provides a C implementation of the inside-outside algorithm for EM training of PCFGs. We ran five iterations of EM on the WSJ10 corpus26 using the Treebank grammar from that corpus. Dyna took 4.1 times longer.</Paragraph> <Paragraph position="7"> Early stopping. An advantage of the weighted agenda discipline (SS4.2) is that, with a reasonable priority function such as an item's inside probability, the inside algorithm can be stopped early with an estimate of goal's value. To measure the goodness of this early estimate, we tracked the progression of goal's value as each sentence was being parsed. In most instances, and especially after more EM iterations, the estimate was very tight long before all the weight had been accumulated (Table 1). This suggests that early stopping is a useful training speedup. PRISM. The implemented tool most similar to Dyna that we have found is PRISM (Zhou and Sato, 2003), a probabilistic Prolog with efficient tabling and compilation. PRISM inherits expressive power from Prolog but handles only probabilities, not general semirings (or even side conditions).27 In CKY parsing tests, PRISM was able to handle only a small fraction of the Penn Treebank ruleset (2,400 high-probability rules) and tended to crash on long sentences. Dyna is designed for real-world use: it consistently parses over 10x faster than PRISM and scales to full-sized problems.</Paragraph> <Paragraph position="8"> IBAL (Pfeffer, 2001) is an elegant and powerful language for probabilistic modeling; it generalizes Bayesian networks in interesting ways.28 Since 26Sentences with [?]10 words, stripping punctuation.</Paragraph> <Paragraph position="9"> 27Thus it can handle a subset of the cases described by Goodman (1999), again by building the whole parse forest. 28It might be possible to implement IBAL in Dyna (Pfeffer, PCFGs and marginalization can be succinctly expressed in IBAL, we attempted a performance comparison on the task of the inside algorithm (Fig. 1). Unfortunately, IBAL's algorithm appears not to terminate if the PCFG contains any kind of recursion reachable from the start symbol.</Paragraph> </Section> </Section> class="xml-element"></Paper>