XML Viewer - p04-3032

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3032_metho.xml
Size: 14,767 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-3032">
  <Title>Dyna: A Declarative Language for Implementing Dynamic Programs[?]</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A Basic Example: PCFG Parsing
</SectionTitle>
    <Paragraph position="0"> We believe Dyna is a flexible and intuitive specification language for dynamic programs. Such a program specifies how to combine partial solutions until a complete solution is reached.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Inside Algorithm, in Dyna
</SectionTitle>
      <Paragraph position="0"> Fig. 1 shows a simple Dyna program that corresponds to the inside algorithm for PCFGs (i.e., the probabilistic generalization of CKY parsing). It may be regarded as a system of equations over an arbitrary number of unknowns, which have structured names such as constit(s,0,3). These unknowns are called items. They resemble variables in a C program, but we use variable instead to refer to the capitalized identifiers X, I, K, . . . in lines 2-4.1 At runtime, a user must provide an input sentence and grammar by asserting values for certain items. If the input is John loves Mary, the user should assert values of 1 for word(John,0,1), word(loves,1,2), word(Mary,2,3), and end(3). If the PCFG contains a rewrite rule np Mary with probability p(Mary  |np) = 0.003, the user should assert that rewrite(np,Mary) has value 0.003.</Paragraph>
      <Paragraph position="1"> Given these base cases, the equations in Fig. 1 enable Dyna to deduce values for other items. The deduced value of constit(s,0,3) will be the inside probability bs(0,3),2 and the deduced value of goal will be the total probability of all parses of the input.</Paragraph>
      <Paragraph position="2"> Lines 2-4 are equational schemas that specify how to compute the value of items such as constit(s,0,3) from the values of other items. By using the summation operator +=, lines 2-3 jointly say that for any X, I, and K, constit(X,I,K) is defined by summation over the remaining variables, as summationtextW rewrite(X,W)*word(W,I,K) +summationtext Y,Z,J rewrite(X,Y,Z)*constit(Y,I,J)*constit(Z,J,K). Forexample, constit(s,0,3) is a sum of quantities such as rewrite(s,np,vp)*constit(np,0,1)*constit(vp,1,3).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Execution Model
</SectionTitle>
      <Paragraph position="0"> Dyna's declarative semantics state only that it will find values such that all the equations hold.3 Our implementation's default strategy is to propagate updates from an equation's right-hand to its left-hand side, until the system converges. Thus, by default, Fig. 1 yields a bottom-up or data-driven parser.</Paragraph>
      <Paragraph position="1"> 1Much of our terminology (item, chart, agenda) is inherited from the parsing literature. Other terminology (variable, term, inference rule, antecedent/consequent, assert/retract, chaining) comes from logic programming. Dyna's syntax borrows from both Prolog and C.</Paragraph>
      <Paragraph position="2"> 2That is, the probability that s would stochastically rewrite to the first three words of the input. If this can happen in more than one way, the probability sums over multiple derivations.</Paragraph>
      <Paragraph position="3"> 3Thus, future versions of the compiler are free to mix any efficient strategies, even calling numerical equation solvers.</Paragraph>
      <Paragraph position="4">  1. :- valtype(term, real). % declares that all item values are real numbers 2. constit(X,I,K) += rewrite(X,W) * word(W,I,K). % a constituent is either a word . . . 3. constit(X,I,K) += rewrite(X,Y,Z) * constit(Y,I,J) * constit(Z,J,K). % . . . or a combination of two adjacent subconstituents 4. goal += constit(s,0,N) * end(N). % a parse is any s constituent that covers the input string  Dyna may be seen as a new kind of tabled logic programming language in which theorems are not just proved, but carry values. This suggests some terminology. Lines 2-4 of Fig. 1 are called inference rules. The items on the right-hand side are antecedents, and the item on the left-hand side is their consequent. Assertions can be regarded as axioms. And the default strategy (unlike Prolog's) is forward chaining from the axioms, as in some theorem provers.</Paragraph>
      <Paragraph position="5"> Suppose constit(verb,1,2) increases by [?]. Then the program in Fig. 1 must find all the instantiated rules that have constit(verb,1,2) as an antecedent, and must update their consequents. For example, since line 3 can be instantiated as constit(vp,1,3) += rewrite(vp,verb,np)*constit(verb,1,2)*constit(np,2,3), then constit(vp,1,3) must be increased by</Paragraph>
      <Paragraph position="7"> Line 3 actually requires infinitely many such updates, corresponding to all rule instantiations of the form constit(X,1,K) += rewrite(X,verb,Z)*constit(verb,1,2)*constit(Z,2,K).4 However, most of these updates would have no effect. We only need to consider the finitely many instantiations where rewrite(X,verb,Z) and constit(Z,2,K) have nonzero values (because they have been asserted or updated in the past).</Paragraph>
      <Paragraph position="8"> The compiled Dyna program rapidly computes this set of needed updates and adds them to a worklist of pending updates, the agenda. Updates from the agenda are processed in some prioritized order (which can strongly affect the speed of the program). When an update is carried out (e.g., constit(vp,1,3) is increased), any further updates that it triggers (e.g., to constit(s,0,3)) are placed back on the agenda in the same way. Multiple updates to the same item are consolidated on the agenda. This cascading update process begins with axiom assertions, which are treated like other updates.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Closely Related Algorithms
</SectionTitle>
      <Paragraph position="0"> We now give some examples of variant algorithms.</Paragraph>
      <Paragraph position="1"> Fig. 1 provides lattice parsing for free. Instead of being integer positions in an string, I, J and K can be symbols denoting states in a finite-state automaton. The code does not have to change, only the input. Axioms should now correspond to weighted lattice arcs, e.g., word(loves,q,r) with value p(portion of speech signal | loves).</Paragraph>
      <Paragraph position="2"> To find the probability of the best parse instead of the total probability of all parses, simply change the value type: replace real with viterbi in line 1. If a and b are viterbi values, a+b is implemented as max(a,b).5  resent log probabilities (for speed and dynamic range). Similarly, replacing real with boolean obtains an unweighted parser, in which a constituent is either derived (true value) or not (false value) Then a*b is implemented as a [?] b, and a+b as a [?] b.</Paragraph>
      <Paragraph position="3"> The Dyna programmer can declare the agenda discipline--i.e., the order in which updates are processed--to obtain variant algorithms. Although Dyna supports stack and queue (LIFO and FIFO) disciplines, its default is to use a priority queue prioritized by the size of the update. When parsing with real values, this quickly accumulates a good approximation of the inside probabilities, which permits heuristic early stopping before the agenda is empty. With viterbi values, it amounts to uniform-cost search for the best parse, and an item's value is guaranteed not to change once it is nonzero. Dyna will soon allow user-defined priority functions (themselves dynamic programs), which can greatly speed up parsing (Caraballo and Charniak, 1998; Klein and Manning, 2003).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Parameter Training
</SectionTitle>
      <Paragraph position="0"> Dyna provides facilities for training parameters. For example, from Fig. 1, it automatically derives the inside-outside (EM) algorithm for training PCFGs.</Paragraph>
      <Paragraph position="1"> How is this possible? Once the program of Fig. 1 has run, goal's value is the probability of the input sentence under the grammar. This is a continuous function of the axiom values, which correspond to PCFG parameters (e.g., the weight of rewrite(np,Mary)). The function could be written out explicitly as a sum of products of sums of products of . . . of axiom values, with the details depending on the sentence and grammar.</Paragraph>
      <Paragraph position="2"> Thus, Dyna can be regarded as computing a function F(vectorth), where vectorth is a vector of axiom values and F(vectorth) is an objective function such as the probability of one's training data. In learning, one wishes to repeatedly adjust vectorth so as to increase F(vectorth).</Paragraph>
      <Paragraph position="3"> Dyna can be told to evaluate the gradient of the function with respect to the current parameters vectorth: e.g., if rewrite(vp,verb,np) were increased by epsilon1, what would happen to goal? Then any gradient-based optimization method can be applied, using Dyna to evaluate both F(vectorth) and its gradient vector. Also, EM can be applied where appropriate, since it can be shown that EM's E counts can be derived from the gradient. Dyna's strategy for computing the gradient is automatic differentiation in the reverse mode (Griewank and Corliss, 1991), known in the neural network community as back-propagation.</Paragraph>
      <Paragraph position="4"> Dyna comes with a constrained optimization module, DynaMITE,6 that can locally optimize F(vectorth). At present, DynaMITE provides the conjugate gradient and variable metric methods, using the Toolkit for Advanced Optimization (Benson et al., 2000) together with a softmax</Paragraph>
      <Paragraph position="6"> technique to enforce sum-to-one constraints. It supports maximum-entropy training and the EM algorithm.7 DynaMITE provides an object-oriented API that allows independent variation of such diverse elements of training as the model parameterization, optimization algorithm, smoothing techniques, priors, and datasets.</Paragraph>
      <Paragraph position="7"> How about supervised or partly supervised training? The role of supervision is to permit some constituents to be built but not others (Pereira and Schabes, 1992).</Paragraph>
      <Paragraph position="8"> Lines 2-3 of Fig. 1 can simply be extended with an additional antecedent permitted(X,I,K), which must be either asserted or derived for constit(X,I,K) to be derived. In &amp;quot;soft&amp;quot; supervision, the permitted axioms may have values between 0 and 1.8</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 C++ Interface and Implementation
</SectionTitle>
    <Paragraph position="0"> A Dyna program compiles to a set of portable C++ classes that manage the items and perform inference.</Paragraph>
    <Paragraph position="1"> These classes can be used in a larger C++ application.9 This strategy keeps Dyna both small and convenient.</Paragraph>
    <Paragraph position="2"> A C++ chart object supports the computation of item values and gradients. It keeps track of built items, their values, and their derivations, which form a proof forest. It also holds an ordered agenda of pending updates. Some built items may be &amp;quot;transient,&amp;quot; meaning that they are not actually stored in the chart at the moment but will be transparently recomputed upon demand.</Paragraph>
    <Paragraph position="3"> The Dyna compiler generates a hard-coded decision tree that analyzes the structure of each item popped from the agenda to decide which inference rules apply to it.</Paragraph>
    <Paragraph position="4"> To enable fast lookup of the other items that participate in these inference rules, it generates code to maintain appropriate indices on the chart.</Paragraph>
    <Paragraph position="5"> Objects such as constit(vp,1,3) are called terms and may be recursively nested to any depth. (Items are just terms with values.) Dyna has a full first-order type system for terms, including primitive and disjunctive types, and permitting compile-time type inference. These types are compiled into C++ classes that support constructors and accessors, garbage-collection, subterm sharing (which may lead to asymptotic speedups, as in CCG parsing (Vijay-Shanker and Weir, 1990)), and interning.10 Dyna can import new primitive term types and value types from C++, as well as C++ functions to combine values and to user-define the weights of certain terms.</Paragraph>
    <Paragraph position="6"> In the current implementation, every rule must have the restricted form c += a1*a2*****ak (where each ai is an item or side condition and (X,+,*) is a semiring of values). The design for Dyna's next version lifts this restriction to allow arbitrary, type-heterogeneous expressions on the right-hand side of an inference rule.11</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Some Further Applications
</SectionTitle>
    <Paragraph position="0"> Dyna is useful for any problem where partial hypotheses are assembled, or where consistency has to be maintained. It is already being used for parsing, syntax-based machine translation, morphological analysis, grammar induction, and finite-state operations.</Paragraph>
    <Paragraph position="1"> It is well known that various parsing algorithms for CFG and other formalisms can be simply written in terms of inference rules. Fig. 2 renders one such example in Dyna, namely Earley's algorithm. Two features are worth noting: the use of recursively nested subterms such as lists, and the SIDE function, which evaluates to 1 or 0 according to whether its argument has a defined value yet. These side conditions are used here to prevent hypothesizing a constituent until there is a possible left context that calls for it.</Paragraph>
    <Paragraph position="2"> Several recent syntax-directed statistical machine translation models are easy to build in Dyna. The simplest (Wu, 1997) uses constit(np,3,5,np,4,8) to denote a NP spanning positions 3-5 in the English string that is aligned with an NP spanning positions 4-8 in the Chinese string. When training or decoding, the hypotheses of better-trained monolingual parsers can provide either hard or soft partial supervision (section 2.4).</Paragraph>
    <Paragraph position="3"> Dyna can manipulate finite-state transducers. For instance, the weighted arcs of the composed FST M1 *M2 can be deduced from the arcs of M1 and M2. Training M1*M2 back-propagates to train the original weights in M1 and M2, as in (Eisner, 2002).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML