File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1506_metho.xml
Size: 20,089 bytes
Last Modified: 2025-10-06 14:09:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1506"> <Title>Better k-best Parsing</Title> <Section position="5" start_page="53" end_page="55" type="metho"> <SectionTitle> 3 Formulation </SectionTitle> <Paragraph position="0"> Following Klein and Manning (2001), we use weighted directed hypergraphs (Gallo et al., 1993) as an abstraction of the probabilistic parsing problem.</Paragraph> <Paragraph position="1"> De nition 1. An ordered hypergraph (henceforth hypergraph) H is a tuple hV, E,t,Ri, where V is a nite set of vertices, E is a nite set of hyperarcs, and R is the set of weights. Each hyperarc e 2 E is a triple e = hT(e),h(e), f (e)i, where h(e) 2 V is its head and T(e) 2 V[?] is a vector of tail nodes. f (e) is a weight function from R|T(e) |to R. t 2 V is a distinguished vertex called target vertex.</Paragraph> <Paragraph position="2"> Note that our de nition is different from those in previous work in the sense that the tails are now vectors rather than sets, so that we can allow multiple occurrences of the same vertex in a tail and there is an ordering among the components of a tail.</Paragraph> <Paragraph position="3"> De nition 2. A hypergraph H is said to be monotonic if there is a total ordering ,, on R such that every weight function f in H is monotonic in each of its arguments according to,,, i.e., if f : Rm 7! R, then81 * i * m, if ai ,, aprimei, then f (a1,C/C/C/ ,ai,C/C/C/ ,am) ,, f (a1,C/C/C/ ,aprimei,C/C/C/ ,am). We also de ne the comparison function minprecedesequal(a,b) to output a if a ,, b, or b if otherwise.</Paragraph> <Paragraph position="4"> In this paper we will assume this monotonicity, which corresponds to the optimal substructure property in dynamic programming (Cormen et al., 2001).</Paragraph> <Paragraph position="5"> De nition 3. We denote jej=jT(e)j to be the arity of the hyperarc. If jej = 0, then f (e) 2 R is a constant and we call h(e) a source vertex. We de ne the arity of a hypergraph to be the maximum arity of its hyperarcs.</Paragraph> <Paragraph position="6"> De nition 4. The backward-star BS(v) of a vertex v is the set of incoming hyperarcs fe 2 E j h(e) = vg. The in-degree of v is jBS (v)j.</Paragraph> <Paragraph position="7"> De nition 5. A derivation D of a vertex v in a hypergraph H, its size jDj and its weight w(D) are recursively de ned as follows: * If e 2 BS (v) with jej = 0, then D = he,epsilon1i is a derivation of v, its size jDj = 1, and its weight w(D) = f (e)().</Paragraph> <Paragraph position="8"> * If e 2 BS (v) where jej > 0 and Di is a derivation of Ti(e) for 1 * i * jej, then D = he, D1C/C/C/D|e|i is a derivation of v, its size jDj = 1 +summationtext|e|i=1jDij and its weight w(D) = f (e)(w(D1),...,w(D|e|)).</Paragraph> <Paragraph position="9"> The ordering on weights in R induces an ordering on derivations: D ,, Dprime iff w(D) ,, w(Dprime).</Paragraph> <Paragraph position="10"> De nition 6. De ne Di(v) to be the ith-best derivation of v. We can think of D1(v),..., Dk(v) as the components of a vector we shall denote by D(v). The k-best derivations problem for hypergraphs, then, is to nd D(t) given a hypergraph hV, E,t,Ri.</Paragraph> <Paragraph position="11"> With the derivations thus ranked, we can introduce a nonrecursive representation for derivations that is analogous to the use of back-pointers in parser implementation. null De nition 7. A derivation with back-pointers (dbp) D of v is a tuple he,ji such that e 2 BS(v), and j 2 f1,2,...,kg|e|. There is a one-to-one correspondence >> between dbps of v and derivations of v: he,( j1C/C/C/ j|e|)i>>he, D j1 (T1(e))C/C/C/D jjej(T|e|(e))i Accordingly, we extend the weight function w to dbps:</Paragraph> <Paragraph position="13"> on dbps: D ,, Dprime iff w( D) ,, w( Dprime). Let Di(v) denote the ith-best dbp of v.</Paragraph> <Paragraph position="14"> Where no confusion will arise, we use the terms 'derivation' and 'dbp' interchangeably.</Paragraph> <Paragraph position="15"> Computationally, then, the k-best problem can be stated as follows: given a hypergraph H with arity a, compute D1(t),..., Dk(t).1 As shown by Klein and Manning (2001), hypergraphs can be used to represent the search space of most parsers (just as graphs, also known as trellises or lattices, can represent the search space of nite-state automata or HMMs). More generally, hypergraphs can be used to represent the search space of most weighted deductive system (Nederhof, 2003). For example, the weighted CKY algorithm given a context-free grammar G =hN,T, P,Si in Chomsky Normal Form (CNF) and an input string w can be represented as a hypergraph of arity 2 as follows. Each item [X,i, j] is represented as a vertex v, corresponding to the recognition of nonterminal X spanning w from positions i+1 through j. For each production rule X ! YZ in P and three free indices i < j < k, we have a hyperarc h((Y,i,k),(Z,k, j)),(X,i,k), fi corresponding to the instantiation of the inference rule C in the deductive system of (Shieber et al., 1995), and the weight function f is de ned as f (a,b) = abC/Pr(X ! YZ), which is the same as in (Nederhof, 2003). In this sense, hypergraphs can be thought of as compiled or instantiated versions of weighted deductive systems.</Paragraph> <Paragraph position="16"> A parser does nothing more than traverse this hypergraph. In order that derivation values be computed correctly, however, we need to traverse the hypergraph in a particular order: De nition 8. The graph projection of a hypergraph H = hV, E,t,Ri is a directed graph G = hV, Eprimei where Eprime = f(u,v) j9e 2 BS (v),u 2 T(e)g. A hypergraph H is said to be acyclic if its graph projection G is a directed acyclic graph; then a topological ordering of H is an ordering of V that is a topological ordering in G (from sources to target).</Paragraph> <Paragraph position="17"> We assume the input hypergraph is acyclic so that we can use its topological ordering to traverse it. In practice the hypergraph is typically not known in advance, but the source vertices, (b) a hyperpath pit in H, and (c) a derivation of t in H, where vertex u appears twice with two different (sub-)derivations. This would be impossible in a hyperpath. topological ordering often is, so that the (dynamic) hypergraph can be generated in that order. For example, for CKY it is sufficient to generate all items [X,i, j] before all items [Y,iprime, jprime] when jprime !iprime > j!i (X and Y are arbitrary nonterminals).</Paragraph> <Paragraph position="18"> Excursus: Derivations and Hyperpaths The work of Klein and Manning (2001) introduces a correspondence between hyperpaths and derivations. When extended to the k-best case, however, that correspondence no longer holds.</Paragraph> <Paragraph position="19"> De nition 9. (Nielsen et al., 2005) Given a hypergraph H =hV, E,t,Ri, a hyperpath piv of destination v 2 V is an acyclic minimal hypergraph H... =hV..., E...,v,Risuch that 1. E... E 2. v 2 V... =uniontexte[?]E...(T(e)[fh(e)g) 3. 8u 2 V..., u is either a source vertex or connected to a source vertex in H....</Paragraph> <Paragraph position="20"> As illustrated by Figure 1, derivations (as trees) are different from hyperpaths (as minimal hypergraphs) in the sense that in a derivation the same vertex can appear more than once with possibly different sub-derivations while it is represented at most once in a hyperpath. Thus, the k-best derivations problem we solve in this paper is very different in nature from the k-shortest hyperpaths problem in (Nielsen et al., 2005).</Paragraph> <Paragraph position="21"> However, the two problems do coincide when k = 1 (since all the sub-derivations must be optimal) and for this reason the 1-best hyperpath algorithm in (Klein and Manning, 2001) is very similar to the 1-best tree algorithm in (Knuth, 1977). For k-best case (k > 1), they also coincide when the hypergraph is isomorphic to a Case-Factor Diagram (CFD) (McAllester et al., 2004) (proof omitted). The derivation forest of CFG parsing under the CKY algorithm, for instance, can be represented as a CFD while the forest of Earley algorithm can not. An</Paragraph> <Paragraph position="23"> 1: procedure V(k) 2: for v 2 V in topological order do 3: for e 2 BS(v) do triangleright for all incoming hyperarcs 4: D1(v) ^ minprecedesequal( D1(v),he,1i) triangleright update item (or equivalently, a vertex in hypergraph) can appear twice in an Earley derivation because of the prediction rule (see Figure 2 for an example).</Paragraph> <Paragraph position="24"> The k-best derivations problem has potentially more applications in tree generation (Knight and Graehl, 2005), which can not be modeled by hyperpaths. But detailed discussions along this line are out of the scope of this paper.</Paragraph> </Section> <Section position="6" start_page="55" end_page="56" type="metho"> <SectionTitle> 4 Algorithms </SectionTitle> <Paragraph position="0"> The traditional 1-best Viterbi algorithm traverses the hypergraph in topological order and for each vertex v, calculates its 1-best derivation D1(v) using all incoming hyperarcs e 2 BS(v) (see Figure 3). If we take the arity of the hypergraph to be constant, then the overall time complexity of this algorithm is O(jEj).</Paragraph> <Section position="1" start_page="55" end_page="56" type="sub_section"> <SectionTitle> 4.1 Algorithm 0: nacurrency1 ve </SectionTitle> <Paragraph position="0"> Following (Goodman, 1999; Mohri, 2002), we isolate two basic operations in line 4 of the 1-best algorithm that can be generalized in order to extend the algorithm: rst, the formation of the derivation he,1i out of jej best sub-derivations (this is a generalization of the binary operator > in a semiring); second, minprecedesequal, which chooses the better of two derivations (same as the ' operator in an idempotent semiring (Mohri, 2002)). We now generalize these two operations to operate on k-best lists.</Paragraph> <Paragraph position="1"> Let r = jej. The new multiplication operation, multprecedesequalk(e), is performed in three steps: 1. enumerate the kr derivations fhe, j1C/C/C/ jri j 8i,1 * ji * kg. Time: O(kr).</Paragraph> <Paragraph position="2"> 2. sort these kr derivations (according to weight).</Paragraph> <Paragraph position="3"> Time: O(kr log(kr)) = O(rkr log k).</Paragraph> <Paragraph position="4"> 3. select the rst k elements from the sorted list of kr elements. Time: O(k).</Paragraph> <Paragraph position="5"> So the overall time complexity of multprecedesequalk is O(rkr log k). We also have to extend minprecedesequal to mergeprecedesequalk, which takes two vectors of length k (or fewer) as input and outputs the top k (in sorted order) of the 2k elements. This is similar to merge-sort (Cormen et al., 2001) and can be done in linear time O(k). Then, we only need to rewrite line 4 of the Viterbi algorithm (Figure 3) to extend it to the k-best case:</Paragraph> </Section> </Section> <Section position="7" start_page="56" end_page="56" type="metho"> <SectionTitle> 4: D(v) ^ mergeprecedesequalk( D(v),multprecedesequalk(e)) </SectionTitle> <Paragraph position="0"> and the time complexity for this line is O(jejk|e |log k), making the overall complexity O(jEjka log k) if we consider the arity a of the hypergraph to be constant.2 The overall space complexity is O(jVjk) since for each vertex we need to store a vector of length k.</Paragraph> <Paragraph position="1"> In the context of CKY parsing for CFG, the 1-best Viterbi algorithm has complexity O(n3jPj) while the k-best version is O(n3jPjk2 log k), which is slower by a factor of O(k2 log k).</Paragraph> <Section position="1" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 4.2 Algorithm 1: speed up multprecedesequalk </SectionTitle> <Paragraph position="0"> First we seek to exploit the fact that input vectors are all sorted and the function f is monotonic; moreover, we are only interested in the top k elements of the k|e |possibilities. null De ne 1 to be the vector whose elements are all 1; dene bi to be the vector whose elements are all 0 except bii = 1.</Paragraph> <Paragraph position="1"> As we compute pe = multprecedesequalk(e), we maintain a candidate set C of derivations that have the potential to be the next best derivation in the list. If we picture the input as an jej-dimensional space, C contains those derivations that 2Actually, we do not need to sort all kjej elements in order to extract the top k among them; there is an e-cient algorithm (Cormen et al., 2001) that can select the kth best element from the kjej elements in time O(kjej). So we can improve the overhead to O(ka).</Paragraph> <Paragraph position="2"> have not yet been included in pe, but are on the boundary with those which have. It is initialized to fhe,1ig. At each step, we extract the best derivation from C call it he,ji and append it to pe. Then he,ji must be replaced in C by its neighbors, fhe,j+blij 1 * l *jejg (see Figure 4.2 for an illustration). We implement C as a priority queue (Cormen et al., 2001) to make the extraction of its best derivation efficient. At each iteration, there are one E-M and jej I operations. If we use a binary-heap implementation for priority queues, we get O(jejlog kjej) time complexity for each iteration.3 Since we are only interested in the top k elements, there are k iterations and the time complexity for a single multprecedesequalk is O(kjejlog kjej), yielding an overall time complexity of O(jEjk log k) and reducing the multiplicative overhead by a factor of O(ka[?]1) (again, assuming a is constant). In the context of CKY parsing, this reduces the overhead to O(k log k). Figure 5 shows the additional pseudocode needed for this algorithm. It is integrated into the Viterbi algorithm (Figure 3) simply by rewriting line 4 of to invoke the function M(e,k):</Paragraph> </Section> </Section> <Section position="8" start_page="56" end_page="58" type="metho"> <SectionTitle> 4: D(v) ^ mergeprecedesequalk( D(v),M(e,k)) </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 4.3 Algorithm 2: combine mergeprecedesequalk into multprecedesequalk </SectionTitle> <Paragraph position="0"> We can further speed up both mergeprecedesequalk and multprecedesequalk by a similar idea. Instead of letting each multprecedesequalk generate a full k derivations for each hyperarc e and only then applying mergeprecedesequalk to the results, we can combine the candidate sets for all the hyperarcs into a single candidate set. That is, we initialize C to fhe,1i j e 2 BS (v)g, the set of all the top parses from each incoming hyperarc (cf. Algorithm 1). Indeed, it suffices to keep only the top k out of the jBS (v)jcandidates in C, which would lead to a signi cant speedup in the case where jBS (v)j k. 4 Now the top derivation in C is the top derivation for v. Then, whenever we remove an element he,ji from C, we replace it with the jej elements fhe,j + bli j 1 * l * jejg (again, as in Algorithm 1). The full pseudocode for this algorithm is shown in Figure 6.</Paragraph> </Section> <Section position="2" start_page="56" end_page="58" type="sub_section"> <SectionTitle> 4.4 Algorithm 3: compute multprecedesequalk lazily </SectionTitle> <Paragraph position="0"> Algorithm 2 exploited the idea of lazy computation: performing multprecedesequalk only as many times as necessary. But this algorithm still calculates a full k-best list for every vertex in the hypergraph, whereas we are only interested in 3If we maintain a Min-Heap along with the Min-Heap, we can reduce the per-iteration cost to O(jejlog k), and with Fibonacci heap we can further improve it to be O(jej+log k). But these techniques do not change the overall complexity when a is constant, as we will see.</Paragraph> <Paragraph position="1"> 4This can be implemented by a linear-time randomizedselection algorithm (a.k.a. quick-select) (Cormen et al., 2001). function f is de ned as f (a,b) = a + b. Italic numbers on the x and y axes are ai's and b j's, respectively. We want to compute the top 3 results from f (ai,bj) with 1 * i, j * 3. In each iteration the current frontier is shown in oval boxes, with the bold-face denoting the best element among them. That element will be extracted and replaced by its two neighbors (* and )) in the next iteration.</Paragraph> <Paragraph position="2"> 1: function M(e,k) 2: cand ^fhe,1ig triangleright initialize the heap 3: p ^ empty list triangleright the result of multprecedesequalk 4: while jpj< k and jcandj> 0 do 5: AN(cand,p,k) 6: return p 7: 8: procedure AN(cand, p) 9: he,ji^ E-M(cand) 10: append he,ji to p 11: for i ^ 1...jej do triangleright add the jej neighbors 12: jprime ^ j+bi 13: if jprimei *j D(Ti(e))j and he,jprimeinelement cand then 14: I(cand,he,jprimei) triangleright add to heap 1: procedure LKB(v,k,kprime) triangleright kprime is the global k 2: if j D(v)j, k then triangleright kth derivation already computed? 3: return 4: if cand[v] is not de ned then triangleright rst visit of vertex v? 5: GC(v,kprime) triangleright initialize the heap 6: append E-M(cand[v]) to D(v) triangleright 1-best 7: while j D(v)j< k and jcand[v]j> 0 do 8: he,ji^ D |D(v)|(v) triangleright last derivation 9: LN(cand[v],e,j,kprime) triangleright update the heap, adding the successors of last derivation 10: append E-M(cand[v]) to D(v) triangleright get the next best derivation and delete it from the heap 11: 12: procedure LN(cand, e, j, kprime) 13: for i ^ 1...jej do triangleright add the jej neighbors 14: jprime ^ j+bi 15: LKB(Ti(e), jprimei,kprime) triangleright recursively solve a sub-problem 16: if jprimei *j D(Ti(e))j and he,jprimeinelement cand then triangleright if it exists and is not in heap yet 17: I(cand,he,jprimei) triangleright add to heap the k-best derivations of the target vertex (goal item). We can therefore take laziness to an extreme by delaying the whole k-best calculation until after parsing. Algorithm 3 assumes an initial parsing phase that generates the hypergraph and nds the 1-best derivation of each item; then in the second phase, it proceeds as in Algorithm 2, but starts at the goal item and calls itself recursively only as necessary. The pseudocode for this algorithm is shown in Figure 7. As a side note, this second phase should be applicable also to a cyclic hypergraph as long as its derivation weights are bounded.</Paragraph> <Paragraph position="3"> Algorithm 2 has an overall complexity of O(jEj + jVjk log k) and Algorithm 3 is O(jEj+jDmaxjk log k) where jDmaxj is the size of the longest among all top k derivations (for CFG in CNF,jDj= 2n!1 for all D, sojDmaxjis O(n)). These are signi cant improvements against Algorithms 0 and 1 since it turns the multiplicative overhead into an additive overhead. In practice, jEj usually dominates, as in CKY parsing of CFG. So theoretically the running times grow very slowly as k increases, which is exactly demonstrated by our experiments below.</Paragraph> </Section> <Section position="3" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 4.5 Summary and Discussion of Algorithms </SectionTitle> <Paragraph position="0"> The four algorithms, along with the 1-best Viterbi algorithm and the generalized Jim*enez and Marzal algorithm, are compared in Table 1.</Paragraph> <Paragraph position="1"> The key difference between our Algorithm 3 and Jim*enez and Marzal's algorithm is the restriction of top k candidates before making heaps (line 11 in Figure 6, see also Sec. 4.3). Without this line Algorithm 3 could be considered as a generalization of the Jim*enez and Marzal algorithm to the case of acyclic monotonic hypergraphs. This line is also responsible for improving the time complexity from O(jEj + jDmaxjk log(d + k)) (generalized Jim*enez and Marzal algorithm) to O(jEj+ jDmaxjk log k), where d = maxvjBS (v)j is the maximum in-degree among all vertices. So in case k < d, our algorithm outperforms Jim*enez and Marzal's.</Paragraph> </Section> </Section> class="xml-element"></Paper>