XML Viewer - j03-1006

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/j03-1006_metho.xml
Size: 20,473 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="J03-1006">
  <Title>c(c) 2003 Association for Computational Linguistics Squibs and Discussions Weighted Deductive Parsing and Knuth's Algorithm</Title>
  <Section position="3" start_page="0" end_page="139" type="metho">
    <SectionTitle>
2. Weighted Deductive Parsing
</SectionTitle>
    <Paragraph position="0"> The use of deduction systems for specifying parsers has been proposed by Shieber, Schabes, and Pereira (1995) and Sikkel (1997). As already remarked by Goodman (1999), deduction systems can also be extended to manipulate weights.</Paragraph>
    <Paragraph position="1">  Here we de[?] Faculty of Arts, Humanities Computing, University of Groningen, P.O. Box 716, NL-9700 AS Groningen, The Netherlands. E-mail: markjan@let.rug.nl. Secondary affiliation is the German Research Center for Artificial Intelligence (DFKI).</Paragraph>
    <Paragraph position="2"> 1 Weighted deduction is closely related to probabilistic logic, although the problem considered in this article (viz., finding derivations with lowest weights) is different from typical problems in probabilistic logic. For example, Frisch and Haddawy (1994) propose inference rules that manipulate logical formulas attached to intervals of probabilities, and the objective of deduction is to determine intervals that are as narrow as possible.</Paragraph>
    <Paragraph position="3">  Weighted deduction system for bottom-up parsing.</Paragraph>
    <Paragraph position="4"> fine such a weighted deduction system for parsing as consisting of a finite set of inference rules of the form:</Paragraph>
    <Paragraph position="6"> are items, of which I</Paragraph>
    <Paragraph position="8"> are the antecedents, and c</Paragraph>
    <Paragraph position="10"> is a list of side conditions linking the inference rule to the grammar and the input string.  We assign unique variables</Paragraph>
    <Paragraph position="12"> as arguments, to the consequent. This allows us to assign a weight to each occurrence of an (instantiated) item that we derive by an inference rule, by means of a function on the weights of the (instantiated) antecedents of that rule.</Paragraph>
    <Paragraph position="13"> A weighted deduction system furthermore contains a set of goal items; like the inference rules, this set is parameterized by the grammar and the input. The objective of weighted deductive parsing is to find the derivation of a goal item with the lowest weight. In this article we assume that, for a given grammar and input string, each inference rule can be instantiated in a finite number of ways, which ensures that this problem can be solved under the constraints on the weight functions to be discussed in Sections 4 and 5.</Paragraph>
    <Paragraph position="14"> Our examples will be restricted to context-free parsing and include the deduction system for weighted bottom-up parsing in Figure 1 and that for weighted top-down parsing in Figure 2. The latter is very close to an extension of Earley's algorithm described by Lyon (1974). The side conditions refer to an input string w = a</Paragraph>
    <Paragraph position="16"> to a weighted context-free grammar with a set of productions P, each of which has the  form (y: A - a), where y is a non-negative real-valued weight, A is a nonterminal, 2 Note that we have no need for (explicit) axioms, since we allow inference rules to have zero antecedents.</Paragraph>
    <Paragraph position="17">  Set of goal items is as in Figure 1.</Paragraph>
    <Paragraph position="18"> Figure 3 Alternative weighted deduction system for top-down parsing. and a is a list of zero or more terminals or nonterminals. We assume the weight of a grammar derivation is given by the sum of the weights of the occurrences of productions therein.</Paragraph>
    <Paragraph position="19"> Weights may be atomic entities, as in the deduction systems discussed above, where they are real-valued, but they may also be composed entities. For example, Figure 3 presents an alternative form of weighted top-down parsing using pairs of values, following Stolcke (1995). The first value is the forward weight, that is, the sum of weights of all productions that were encountered in the lowest-weighted derivation in the deduction system of an item [A - a * b, i, j]. The second is the inner weight; that is, it considers the weight only of the current production A - ab plus the weights of productions in lowest-weighted grammar derivations for nonterminals in a. These inner weights are the same values as the weights in Figures 1 and 2. In fact, if we omit the forward weights, we obtain the deduction system in Figure 2.</Paragraph>
    <Paragraph position="20"> Since forward weights pertain to larger parts of grammar derivations than the inner weights, they may be better suited to direct the search for the lowest-weighted complete grammar derivation. We assume a pair (z  . (Tendeau [1997] has shown the general idea can also be applied to left-corner parsing.) In order to link (weighted) deduction systems to literature to be discussed in Section 3, we point out that a deduction system having a grammar G in a certain formalism F and input string w in the side conditions can be seen as a construction c of a context-free grammar c(G, w) out of grammar G and input w. The set of productions of c(G, w) is obtained by instantiating the inference rules in all possible ways using productions from G and input positions pertaining to w. The consequent of such an instantiated inference rule then acts as the left-hand side of a production, and the (possibly empty) list of antecedents acts as its right-hand side. In the case of a weighted deduction system, the productions are associated with weight functions computing the weight of the left-hand side from the weights of the right-hand side nonterminals.</Paragraph>
    <Paragraph position="21"> For example, if the input is w = a  , which states that if the production is used in a derivation, then the weights of the two subderivations should be added. The number of productions in c(G, w) is determined by the number of ways we can instantiate inference rules, which in the case of Figure 1 is O(|G|  ), where |G |is the size of G in terms of the total number of occurrences of terminals and nonterminals in productions. If we assume, without loss of generality, that there is only one goal item, then this goal item becomes the start symbol of c(G, w).</Paragraph>
    <Paragraph position="22">  Since there are no terminals in c(G, w), either the grammar generates the language {epsilon1}, containing only the empty string epsilon1,or it generates the empty language; in the latter case, this indicates that w is not in the language generated by G.</Paragraph>
    <Paragraph position="23"> Note that for all three examples above, the derivation with the lowest weight allowed by c(G, w) encodes the derivation with the lowest weight allowed by G for w. Together with the dynamic programming algorithm to be discussed in the next section that finds the derivation with the lowest weight on the basis of c(G, w),we obtain a modular approach to describing weighted parsers: One part of the description specifies how to construct grammar c(G, w) out of grammar G and input w, and the second part specifies the dynamic programming algorithm to investigate c(G, w). Such a modular way of describing parsers in the unweighted case has already been fully developed in work by Lang (1974) and Billot and Lang (1989). Instead of a deduction system, they use a pushdown transducer to express a parsing strategy such as top-down parsing, left-corner parsing or LR parsing. Such a pushdown transducer can in the context of their work be regarded as specifying a context-free grammar c(G, w), given a context-free grammar G and an input string w. The second part of the description of the parser is a dynamic programming algorithm for actually constructing c(G, w) in polynomial time in the length of w.</Paragraph>
    <Paragraph position="24"> This modular approach to describing parsing algorithms is also applicable to formalisms F other than context-free grammars. For example, it was shown by Vijay-Shanker and Weir (1993) that tree-adjoining parsing can be realized by constructing a context-free grammar c(G, w) out of a tree-adjoining grammar G and an input string w. This can straightforwardly be generalized to weighted (in particular, stochastic) tree-adjoining grammars (Schabes 1992).</Paragraph>
    <Paragraph position="25">  Nederhof Weighted Deductive Parsing It was shown by Boullier (2000) that F may furthermore be the formalism of range concatenation grammars. Since the class of range concatenation grammars generates exactly PTIME, this demonstrates the generality of the approach.</Paragraph>
    <Paragraph position="26">  Instead of string input, one may also consider input consisting of a finite automaton, along the lines of Bar-Hillel, Perles, and Shamir (1964); this can be trivially extended to the weighted case. That we restrict ourselves to string input in this article is motivated by presentational considerations.</Paragraph>
  </Section>
  <Section position="4" start_page="139" end_page="141" type="metho">
    <SectionTitle>
3. Knuth's Algorithm
</SectionTitle>
    <Paragraph position="0"> The algorithm by Dijkstra (1959) effectively finds the shortest path from a distinguished source node in a weighted, directed graph to a distinguished target node.</Paragraph>
    <Paragraph position="1"> The underlying idea of the algorithm is that it suffices to investigate only the shortest paths from the source node to other nodes, since longer paths can never be extended to become shorter paths (weights of edges are assumed to be non-negative).</Paragraph>
    <Paragraph position="2"> Knuth (1977) generalizes this algorithm to the problem of finding lowest-weighted derivations allowed by a context-free grammar with weight functions, similar to those we have seen in the previous section. (The restrictions Knuth imposes on the weight functions will be discussed in the next section.) Again, the underlying idea of the algorithm is that it suffices to investigate only the lowest-weighted derivations of nonterminals.</Paragraph>
    <Paragraph position="3"> The algorithm by Knuth is presented in Figure 4. We have taken the liberty of making some small changes to Knuth's formulation. The largest difference between Knuth's formulation and ours is that we have assumed that the context-free grammar with weight functions on which the algorithm is applied has the form c(G, w), obtained by instantiating the inference rules of a weighted deduction system for given grammar G and input w. Note, however, that c(G, w) is not fully constructed before applying Knuth's algorithm, and the algorithm accesses only as much of it as is needed in its search for the lowest-weighted goal item.</Paragraph>
    <Paragraph position="4"> In the algorithm, the set D contains items I for which the lowest overall weight has been found; this weight is given by u(I). The set E contains items I  . In each iteration, it is established that the lowest weight n(I) for an item I in E is the lowest overall weight for I, which justifies transferring I to D. The algorithm can be extended to output the derivation corresponding to the goal item with the lowest weight; this is fairly trivial and will not be discussed here.</Paragraph>
    <Paragraph position="5"> A few remarks about the implementation of Knuth's algorithm are in order. First, instead of constructing E and n anew at step 2 for each iteration, it may be more efficient to construct them only once and revise them every time a new item I is added to D. This revision consists in removing I from E and combining it with existing items in D, as antecedents of inference rules, in order to find new items to be added to E and/or to update n to assign lower values to items in E. Typically, E would be organized as a priority queue.</Paragraph>
    <Paragraph position="6"> Second, practical implementations would maintain appropriate tables for indexing the items in such a way that when a new item I is added to D, the lists of existing items in D together with which it matches the lists of antecedents of inference rules can be  )) for all such instantiated inference rules.</Paragraph>
    <Paragraph position="7"> 3. If E is empty, then report failure and halt.</Paragraph>
    <Paragraph position="8"> 4. Choose an item I [?] E such that n(I) is minimal.</Paragraph>
    <Paragraph position="9"> 5. Add I to D, and let u(I)=n(I).</Paragraph>
    <Paragraph position="10"> 6. If I is a goal item, then output u(I) and halt.</Paragraph>
    <Paragraph position="11"> 7. Repeat from step 2.</Paragraph>
    <Paragraph position="12"> Figure 4 Knuth's generalization of Dijkstra's algorithm. Implicit are a weighted deduction system, a grammar G and an input w. For conditions on correctness, see Section 4.</Paragraph>
    <Paragraph position="13"> efficiently found. Since techniques for such kinds of indexing are well-established in the computer science literature, no further discussion is warranted here.</Paragraph>
    <Paragraph position="14"> 4. Conditions on the Weight Functions  A sufficient condition for Knuth's algorithm to correctly compute the derivations with the lowest weights is that the weight functions f are all superior, which means that they are monotone nondecreasing in each variable and that f(x  . For this case, Knuth (1977) provides a short and elegant proof of correctness. Note that the weight functions in Figure 1 are all superior, so that correctness is guaranteed.</Paragraph>
    <Paragraph position="15"> In the case of the top-down strategy from Figure 2, however, the weight functions are not all superior, since we have constant weight functions for the predictor, which may yield weights that are less than their arguments. It is not difficult, however, to show that Knuth's algorithm still correctly computes the derivations with the lowest weights, given that we have already established the correctness for the bottom-up case.</Paragraph>
    <Paragraph position="16"> First, note that items of the form [B -*g, j, j], which are introduced by the initializer in the bottom-up case, can be introduced by the starter or the predictor in the top-down case; in the top-down case, these items are generally introduced later than in the bottom-up case. Second, note that such items can contribute to finding a goal item only if from [B -*g, j, j] we succeed in deriving an item [B - g *, j, k] that is either such that B = S, j = 0, and k = n, or such that there is an item [A - a * Bb, i, j]. In either case, the item [B -*g, j, j] can be introduced by the starter or predictor so that [B - g *, j, k] will be available to the algorithm if and when it is needed to determine the derivation with the lowest weight for [S - g *,0,n] or [A - aB * b, i, k], respectively, which will then have a weight greater than or equal to that of [B -*g, j, j].  Nederhof Weighted Deductive Parsing For the alternative top-down strategy from Figure 3, the proof of correctness is similar, but now the proof depends for a large part on the additional forward weights, the first values in the pairs (z, x); note that the second values are the inner weights (i.e., the weights we already considered in Figures 1 and 2). An important observation is that if there are two derivations for the same item with weights (z  ). This shows that no relevant inner weights are overlooked because of the ordering we imposed on pairs (z, x).</Paragraph>
    <Paragraph position="17"> Since Figures 1 through 3 are merely examples to illustrate the possibilities of deduction systems and Knuth's algorithm, we do not provide full proofs of correctness.</Paragraph>
  </Section>
  <Section position="5" start_page="141" end_page="142" type="metho">
    <SectionTitle>
5. Viterbi's Algorithm
</SectionTitle>
    <Paragraph position="0"> This section places Knuth's algorithm in the context of a more commonly used alternative. This algorithm is applicable on a weighted deduction system if a simple partial order on items exists that is such that the antecedents of an inference rule are always strictly smaller than the consequent. When this is the case, we may treat items from small to large to compute their lowest weights. There are no constraints on the weight functions other than that they should be monotone nondecreasing.</Paragraph>
    <Paragraph position="1"> The algorithm by Viterbi (1967) may be the earliest that operates according to this principle. The partial order is based on the linear order given by a string of input symbols. In this article we will let the term &amp;quot;Viterbi's algorithm&amp;quot; refer to the general type of algorithm to search for the derivation with the lowest weight given a deduction system, a grammar, an input string, and a partial order on items consistent with the inference rules in the sense given above.</Paragraph>
    <Paragraph position="2">  Another example of an algorithm that can be seen as an instance of Viterbi's algorithm was presented by Jelinek, Lafferty, and Mercer (1992). This algorithm is essentially CYK parsing (Aho and Ullman 1972) extended to handle weights (in particular, probabilities). The partial order on items is based on the sizes of their spans (i.e., the number of input symbols that the items cover). Weights of items with smaller spans are computed before the weights of those with larger spans. In cases in which a simple a priori order on items is not available but derivations are guaranteed to be acyclic, one may first determine a topological sorting of the complete set of derivable items and then compute the weights based on that order, following Martelli and Montanari (1978).</Paragraph>
    <Paragraph position="3"> A special situation arises when a deduction system is such that inference rules allow cyclic dependencies within certain subsets of items, but dependencies between these subsets represent a partial order. One may then combine the two algorithms: Knuth's (or Dijkstra's) algorithm is used within each subset and Viterbi's algorithm is used to relate items in distinct subsets. This is exemplified by Bouloutas, Hart, and Schwartz (1991).</Paragraph>
    <Paragraph position="4"> In cases in which both Knuth's algorithm and Viterbi's algorithm are applicable, the main difference between the two is that Knuth's algorithm may halt as soon as the lowest weight for a goal item is found, and no items with larger weights than that goal item need to be treated, whereas Viterbi's algorithm treats all derivable items. This suggests that Knuth's algorithm may be more efficient than Viterbi's. The worst-case time complexity of Knuth's algorithm, however, involves an additional factor because 5 Note that some authors let the term &amp;quot;Viterbi algorithm&amp;quot; refer to any algorithm that computes the &amp;quot;Viterbi parse,&amp;quot; that is, the parse with the lowest weight or highest probability.  Computational Linguistics Volume 29, Number 1 of the maintenance of the priority queue. Following Cormen, Leiserson, and Rivest (1990), this factor is O(log(bardblc(G, w)bardbl)), where bardblc(G, w)bardbl is the number of nonterminals in c(G, w), which is an upper bound on the number of elements on the priority queue at any given time. Furthermore, there are observations by, for example, Chitrao and Grishman (1990), Tjong Kim Sang (1998, Sections 3.1 and 3.4), and van Noord et al.</Paragraph>
    <Paragraph position="5"> (1999, Section 3.9), that suggest that the apparent advantage of Knuth's algorithm does not necessarily lead to significantly lower time costs in practice.</Paragraph>
    <Paragraph position="6"> In particular, consider deduction systems with items associated with spans like, for example, that in Figure 1, in which the span of the consequent of an inference rule is the concatenation of the spans of the antecedents. If weights of individual productions in G differ only slightly, as is often the case in practice, then different derivations for an item have only slightly different weights, and the lowest such weight for a certain item is roughly proportional to the size of its span. This suggests that Knuth's algorithm treats most items with smaller spans before any item with a larger span is treated, and since goal items typically have the maximal span, covering the complete input, there are few derivable items at all that are not treated before any goal item is found.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML