File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1026_metho.xml

Size: 15,143 bytes

Last Modified: 2025-10-06 14:13:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1026">
  <Title>GRAMMAR SPECIALIZATION THROUGH ENTROPY THRESHOLDS</Title>
  <Section position="4" start_page="189" end_page="190" type="metho">
    <SectionTitle>
SCHEME OVERVIEW
</SectionTitle>
    <Paragraph position="0"> In the following scheme, the desired coverage of the specialized grammar is prescribed, and the parse trees are cut up at appropriate places without having to specify the tree-cutting criteria manually:  1. Index the treebank in an and-or tree where the or-nodes correspond to alternative choices of grammar rules to expand with and the and-nodes correspond to the RHS phrases of each grammar rule. Cutting up the parse trees will involve selecting a set of or-nodes in the and-or tree. Let us call these nodes &amp;quot;cutnodes&amp;quot;.</Paragraph>
    <Paragraph position="1"> 2. Calculate the entropy of each or-node. We will cut at  each node whose entropy exceeds a threshold value. The rationale for this is that we wish to cut up the parse trees where we can expect a lot of variation i.e. where it is difficult to predict which rule will be resolved on next. This corresponds exactly to the nodes in the and-or tree that exhibit high entropy values.</Paragraph>
    <Paragraph position="2"> 3. The nodes of the and-or tree must be partitioned into equivalence classes dependent on the choice of cutnodes in order to avoid redundant derivations at parse time. 4 Thus, selecting some particular node as a cutnode may cause other nodes to also become cutnodes, even though their entropies are not above the threshold.</Paragraph>
    <Paragraph position="3">  4. Determine a threshold entropy that yields the desired coverage. This can be done using for example interval bisection.</Paragraph>
    <Paragraph position="4"> 5. Cut up the training examples by matching them against the and-or tree and cutting at the determined  cutnodes.</Paragraph>
    <Paragraph position="5"> It is interesting to note that a textbook method for conslructing decision trees for classification from attribute-value pairs is to minimize the (weighted average of the) remaining entropy 5 over all possible choices of root attribute, see \[Quinlan 1986\].</Paragraph>
    <Paragraph position="6"> 4This can most easily be seen as follows: Imagine two identical, but different portions of the and-or tree. If the roots and leaves of these portions are all selected as cutnodes, but the distribution of cutnodes within them differ, then we will introduce multiple ways of deriving the portions of the parse trees that match any of these two portions of the and-or tree.</Paragraph>
  </Section>
  <Section position="5" start_page="190" end_page="191" type="metho">
    <SectionTitle>
DETAILED SCHEME
</SectionTitle>
    <Paragraph position="0"> First, the treebank is partitioned into a training set and a test set. The training set will be indexed in an and-or tree and used to extract the specialized rules. The test set will be used to check the coverage of the set of extracted rules.</Paragraph>
    <Paragraph position="1"> Indexing the treebank Then, the set of implicit parse trees is stored in an and-or tree. The parse trees have the general form of a rule identifier Id dominating a list of subtrees or a word of the training sentence. From the current or-node of the and-or tree there will be arcs labelled with rule identifiers corresponding to previously stored parse trees. From this or-node we follow an arc labelled Id, or add a new one if there is none. We then reach (or add) an and-node indicating the RHS phrases of the grammar rule named Id. Here we follow each arc leading out from this and-node in turn to accommodate all the subtrees in the list. Each such arc leads to an or-node.</Paragraph>
    <Paragraph position="2"> We have now reached a point of recursion and can index the corresponding subtree. The recursion terminates if Id is the special rule identifier lex and thus dominates a word of the training sentence, rather than a list of subtrees.</Paragraph>
    <Paragraph position="3"> Indexing the four training examples of Figure 1 will result in the and-or tree of Figure 2.</Paragraph>
    <Paragraph position="4"> Finding the cutnodes Next, we find the set of nodes whose entropies exceed a threshold value. First we need to calculate the entropy of each or-node. We will here describe three different ways of doing this, but there are many others. Before doing this, though, we will discuss the question of redundancy in the resulting set of specialized rules.</Paragraph>
    <Paragraph position="5"> We must equate the cutnodes that correspond to the same type of phrase. This means that if we cut at a node corresponding to e.g. an NP, i.e. where the arcs incident from it are labelled with grammar rules whose left-hand-sides are NPs, we must allow all specialized NP rules to be potentially applicable at this point, not just the ones that are rooted in this node. This requires that we by transitivity equate the nodes that are dominated by a cutnode in a structurally equivalent way; if there is a path from a cutnode cl to a node nl and a path from a cutnode c2 to a node n2 with an identical sequence of labels, the two nodes nl and n2 must be equated. Now if nl is a cutnode, then n2 must also be a cutnode even if it has a low entropy value. The following iterative scheme accomplishes this:  Function N* (N deg) 1. i:=0; 2. Repeat i := i + 1; N i := N(NI-1); 3. Until N i = N i-1 4. Return N~;</Paragraph>
    <Paragraph position="7"> Here N(N j) is the set of cutnodes NJ augmented with those induced in one step by selecting N~ as the set of cutnodes. In ~ practice this was accomplished by compiling an and-or graph from the and-or tree and the set of selected cutnodes, where each set of equated nodes constituted a vertex of the graph, and traversing it.</Paragraph>
    <Paragraph position="8"> In the simplest scheme for calculating the entropy of an or-node, only the RHS phrase of the parent rule, i.e. the dominating and-node, contributes to the entropy, and there is in fact no need to employ an and-or tree at all, since the tree-cutting criterion becomes local to the parse tree being cut up.</Paragraph>
    <Paragraph position="9"> In a slightly more elaborate scheme, we sum over the entropies of the nodes of the parse trees that match this node of the and-or tree. However, instead of letting each daughter node contribute with the full entropy of the LHS phrase of the corresponding grammar rule, these entropies are weighted with the relative frequency of use of each alternative choice of grammar rule.</Paragraph>
    <Paragraph position="10"> For example, the entropy of node n3 of the and-or tree of Figure 2 will be calculated as follows: The mother rule vp_v_np will contribute the entropy associated with the RHS NP, which is, referring to the table above, 0.64. There are 2 choices of rules to resolve on, namely np_det_n and np_np_pp with relative frequencies 1/2 and ~ respectively. Again referring to the entropy table above, we find that the LHS phrases of these rules have entropy 1.33 and 0.00 respectively. This results in the following entropy for node n3:</Paragraph>
    <Paragraph position="12"> The following function determines the set of cutnodes N that either exceed the entropy threshold, or are in-</Paragraph>
  </Section>
  <Section position="6" start_page="191" end_page="192" type="metho">
    <SectionTitle>
2. Return N*(N);
</SectionTitle>
    <Paragraph position="0"> Here S(n) is the entropy of node n.</Paragraph>
    <Paragraph position="1"> In a third version of the scheme, the relative frequencies of the daughters of the or-nodes are used directly to calculate the node entropy:</Paragraph>
    <Paragraph position="3"> Here A is the set of arcs, and {n, ni) is an arc from n to hi. This is basically the entropy used in \[Quinlan 1986\].</Paragraph>
    <Paragraph position="4"> Unfortunately, this tends to promote daughters of cutnodes to in turn become cutnodes, and also results in a problem with instability, especially in conjunction with the additional constraints discussed in a later section, since the entropy of each node is now dependent on the  choice of cutnodes. We must redefine the function N(S) accordingly: Function N(Smin) 1. N O := 0; 2. Repeat i := i+ 1; N := {n: S(nlg '-1) &gt; S,~i,~}; g i := N*(N); 3. Until N*&amp;quot; = N i-1 4. Return N i; Here S(n\]N j) is the entropy of node n given that the set of cutnodes is NJ. Convergence can be ensured 6 by modifying the termination criterion to be 3. Until 3j e \[0, i- 1\] : p(Ni,Y j) &lt; 6(Yi,N j) for some appropriate set metric p(N1, N2) (e.g. the size of the symmetric difference) and norm-like function 6(N1,N2) (e.g. ten percent of the sum of the sizes),  but this is to little avail, since we are not interested in solutions far away from the initial assignment of cutnodes. null Finding the threshold We will use a simple interval-bisection technique for finding the appropriate threshold value. We operate with a range where the lower bound gives at least the desired coverage, but where the higher bound doesn't. We will take the midpoint of the range, find the cutnodes corresponding to this value of the threshold, and check if this gives us the desired coverage. If it does, this becomes the new lower bound, otherwise it becomes the new upper bound. If the lower and upper bounds are close to each other, we stop and return the nodes corresponding to the lower bound. This termination criterion can of course be replaced with something more elaborate. This can be implemented as follows:  Function N(Co) 1. Stow := 0; Shigh := largenumber; Nc := N(0); 2. If Shigh - Sto~o &lt; 6s then goto 6 Sto,,, + Sh i h . else Staid := 2 ' 3. N := N(Smla); 4. If c(g) &lt; Co then Shiflh :: Srnid else Sio~, := Smld; NC/ := N; 5. Goto 2; 6. Return Arc;  Here C(N) is the coverage on the test set of the specialized grammar determined by the set of cutnodes N. Actually, we also need to handle the boundary case where no assignment of cutnodes gives the required coverage. Likewise, the coverages of the upper and lower bound may be far apart even though the entropy difference is small, and vice versa. These problems can readily be taken care of by modifying the termination criterion, but the solutions have been omitted for the sake of clarity.</Paragraph>
    <Paragraph position="5">  In the running example, using the weighted sum of the phrase entropies as the node entropy, if any threshold value less than 1.08 is chosen, this will yield any desired coverage, since the single test example of Figure 1 is then covered.</Paragraph>
    <Paragraph position="6"> Retrieving the specialized rules When retrieving the specialized rules, we will match each training example against the and-or tree. If the current node is a cutnode, we will cut at this point in the training example. The resulting rules will be the set of cut-up training examples. A threshold value of say 1.00 in our example will yield the set of cutnodes {u3, n4, n6, ng} and result in the set of specialized rules of Figure 3.</Paragraph>
    <Paragraph position="7"> If we simply let the and-or tree determine the set of specialized rules, instead of using it to cut up the training examples, we will in general arrive at a larger number of rules, since some combinations of choices in  the and-or tree may not correspond to any training example. If this latter strategy is used in our example, this will give us the two extra rules of Figure 4. Note that they not correspond to any training example.</Paragraph>
  </Section>
  <Section position="7" start_page="192" end_page="193" type="metho">
    <SectionTitle>
ADDITIONAL CONSTRAINTS
</SectionTitle>
    <Paragraph position="0"> As mentioned at the beginning, the specialized grammar is compiled into LR parsing tables. Just finding any set of cutnodes that yields the desired coverage will not necessarily result in a grammar that is well suited for LP~ parsing. In particular, LR parsers, like any other parsers employing a bottom-up parsing strategy, do not blend well with empty productions. This is because without top-down filtering, any empty production is applicable at any point in the input string, and a naive bottom-up parser will loop indefinitely. The LR parsing tables constitute a type of top-down filtering, but this may not be sufficient to guarantee termination, and in any case, a lot of spurious applications of empty productions will most likely take place, degrading performance. For these reasons we will not allow learned rules whose RHSs are empty, but simply refrain from cutting in nodes of the parse trees that do not dominate at least one lexical lookup.</Paragraph>
    <Paragraph position="1"> Even so, the scheme described this far is not totally successful, the performance is not as good as using hand-coded tree-cutting criteria. This is conjectured to be an effect of the reduction lengths being far too short. The first reason for this is that for any spurious rule reduction to take place, the corresponding RHS phrases must be on the stack. The likelihood for this to happen by chance decreases drastically with increased rule length. A second reason for this is that the number of states visited will decrease with increasing reduction length. This can most easily be seen by noting that the number of states visited by a deterministic LR parser equals the number of shift actions plus the number of reductions, and equals the number of nodes in the cot- null responding parse tree, and the longer the reductions, the more shallow the parse tree.</Paragraph>
    <Paragraph position="2"> The hand-coded operationality criteria result in an average rule length of four, and a distribution of reduction lengths that is such that only 17 percent are of length one and 11 percent are of length two. This is in sharp contrast to what the above scheme accomplishes; the corresponding figures are about 20 or 30 percent each for lengths one and two.</Paragraph>
    <Paragraph position="3"> An attempted solution to this problem is to impose restrictions on neighbouring cutnodes. This can be done in several ways; one that has been tested is to select for each rule the RHS phrase with the least entropy, and prescribe that if a node corresponding to the LHS of the rule is chosen as a cutnode, then no node corresponding to this RHS phrase may be chosen as a cutnode, and vice versa. In case of such a conflict, the node (class) with the lowest entropy is removed from the set of cutnodes.</Paragraph>
    <Paragraph position="4"> We modify the function N* to handle this: 2. Repeat i := i+ 1; N i := N(N i-1) \ B(Ni-1); Here B(NJ) is the set of nodes in NJ that should be removed to avoid violating the constraints on neighbouring cutnodes. It is also necessary to modify the termination criterion as was done for the function N(S,,~in) above. Now we can no longer safely assume that the coverage increases with decreased entropy, and we must also modify the interval-bisection scheme to handle this. It has proved reasonable to assume that the coverage is monotone on both sides of some maximum, which simplifies this task considerably.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML