XML Viewer - w98-1115

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1115_metho.xml
Size: 12,021 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1115">
  <Title>Edge-Based Best-First Chart Parsing *</Title>
  <Section position="3" start_page="127" end_page="128" type="metho">
    <SectionTitle>
2 Constituent-Based Best-First
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="127" end_page="128" type="sub_section">
      <SectionTitle>
Chart Parsing
</SectionTitle>
      <Paragraph position="0"> In the approach taken in C&amp;C, only completed edges, i.e., constituents, are entered into the agenda; incomplete edges are always processed as soon as they are constructed. At each iteration the constituent with the highest figure of merit is removed from the agenda, added to the chart, and used to extend current partially completed constituents. Thus we characterize their work as constituent-based best-first chart parsing. C&amp;C take as an &amp;quot;ideal&amp;quot; FOM the quantity p(N~,~ \[ to,n). Here N~,k is aconstituent of type i  (e.g., NP, VP, etc.) that spans the constituents from j up to but not including k, and t0,n are the n parts-of-speech (tags) of the sentence. Note that C&amp;C simplify parsing by assuming that the input is a sequence of tags, not words. We make the same assumption in this paper. Thus taking p(Nj, k \[ t0,n) as an FOM says that one should work on the constituent that is most likely to be correct given the tags of the sentence.</Paragraph>
      <Paragraph position="1"> As p(N~, k \[ to,n) can only be computed precisely after a full parse of the sentence, C&amp;C derive several approximations, in each case starting from the well known equation for</Paragraph>
      <Paragraph position="3"> where /3(Nj,k) and a(N~,k) are defined as follows: null</Paragraph>
      <Paragraph position="5"> C&amp;Cs best approximation is based upon the equation:</Paragraph>
      <Paragraph position="7"> Informally, this can be obtained by approximating the outside probability ot(Nj,k) in Equation 1 with a bitag estimate.</Paragraph>
      <Paragraph position="8"> Of the five terms in Equation 4, two can be directly estimated from training data: the &amp;quot;boundary statistics&amp;quot; p(Nj, k I tj) (the probability of a constituent of type Nj,kstarting just after the tag tj) and p(tk I N~, k) (the probability of tk appearing just after the end of a constituent of type Nj k)- The tag sequence probabilitiy in the denominator is approximated using a bi-tag approximation:</Paragraph>
      <Paragraph position="10"> The basic algorithm then is quite simple. One uses the standard chart-parsing algorithm, except at each iteration one takes from the agenda the constituent that maximizes the FOM described in Equation 4.</Paragraph>
      <Paragraph position="11"> There are, however, two minor complexities that need to be noted. The first relates to the inside probability ~(N;,k). C&amp;C approximate it with the sum of the probabilities of all the parses for N~, k found at that point in the parse. This in turn requires a somewhat complicated scheme to avoid repeatedly re-evaluating Equation 4 whenever a new parse is found. In this paper we adopt a slightly simpler method. We approximate fl(N~,k) by the most probable parse for N~ , rather than the sum of all the parses. j~k We justify this on the grounds that our parser eventually returns the most probable parse, so it seems reasonable to base our metric on its value.</Paragraph>
      <Paragraph position="12"> This also simplifies updating i fl(Nj,k) when new parses are found for N~ k- Our algorithm compares the probability of the new parse to the best already found for Nj, k. If the old one is higher, nothing need be done. If the new one is higher, it is simply added to the agenda.</Paragraph>
      <Paragraph position="13"> The second complexity has to do with the fact that in Equation 4 the probability of the tags tj,k are approximated using two different distributions, once in the numerator where we use the PCFG probabilities, and once in the denominator, where we use the bi-tag probabilities. One fact noted by C&amp;C, but not discussed in their paper, is that typically the bi-tag model gives higher probabilities for a tag sequence than does the PCFG distribution. For any single tag tj, the difference is not much, but as we use Equation 4 to compute our FOM for larger constituents, the numerator becomes smaller and smaller with respect to the denominator, effectively favoring smaller constituents. To avoid this one needs to normalize the two distributions to produce more similar results.</Paragraph>
      <Paragraph position="14"> We have empirically measured the normalization factor and found that the bi-tag distribution produces probabilities that are approximately 1.3 times those produced by the PCFG distribution, on a per-word basis. We correct for this by making the PCFG probability of a known tag r/ &gt; 1. This has the effect of multiplying the inside probability ~(Ni,k ) by rl k-j 3 In Section 4 we show how the behavior of our algorithm changes for r/s between 1.0 and 2.4.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="128" end_page="129" type="metho">
    <SectionTitle>
3 Chart parsing and binarization
</SectionTitle>
    <Paragraph position="0"> Informally, our algorithm differs from the one presented in C&amp;C primarily in that we rank all edges, incomplete as well as complete, with respect to the FOM. A straight-forward way to extend C&amp;C in this fashion is to transform the grammar so that all productions are either unary or binary. Once this has been done there is no need for incomplete edges at all in bottom-up parsing, and parsing can be performed using the CKY algorithm, suitably extended to handle unary productions.</Paragraph>
    <Paragraph position="1"> One way to convert a PCFG into this form is left-factoring (Hopcroft and Ullman, 1979). Left-factoring replaces each production A ~ /3 : p, where p is the production probability and Jill = n &gt; 2, with the following set of binary productions: A ~ '~1,n-l'fln :P 'fll,i' ~ '~l,i-l' ~i : 1.0 '/~1,2' ~ /~1 ~2:1.0 for i e \[3, n\] In these productions j3i is the ith element of ~3 and '~3i,j' is the subsequence /3i...flj of fl, but treated as a 'new' single non-terminal in the left-factored grammar (the quote marks indicate that this subsequence is to be considered a single symbol).</Paragraph>
    <Paragraph position="2"> For example, the production  It is not difficult to show that the left-factored grammar defines the same probability distribution over strings as the original grammar, and to devise a tree transformation that maps each parse tree of the original grammar into a unique parse tree of the left-factored grammar of the same probability.</Paragraph>
    <Paragraph position="3"> In fact, the assumption that all productions are at most binary is not extraordinary, since tabular parsers that construct complete parse forests in worst-case O(n 3) time explicitly or implicitly convert their grammars into binary branching form (Lang, 1974; Lang, 1991).</Paragraph>
    <Paragraph position="4"> Sikkel and Nijholt (1997) describe in detail the close relationship between the CKY algorithm, the Earley algorithm and a bottom-up  variant of the Earley algorithm. The key observation is that the 'new' non-terminals 'fll,i' in a CKY parse using a left-factored grammar correspond to the set of non-empty incomplete edges A ~ fll,i &amp;quot;fli+l,n in the bottom-up variant of tim Earley algorithm, where A ~ fll,n is a production of the original grammar. Specifically, the fundamental rule of chart parsing (Kay, 1980), which combines an incomplete edge A ~ a. Bfl with a complete edge B ~ '7- to yield the edge A ~ aB. fl, corresponds to the left-factored productions 'aB' ~ a B if fl is non-empty or A ~ 'a'B if fl is empty. Thus in general a single 'new' non-terminal in a CKY parse using the left-factored grammar abbreviates several incomplete edges in the Earley algorithm.</Paragraph>
  </Section>
  <Section position="5" start_page="129" end_page="130" type="metho">
    <SectionTitle>
4 The Experiment
</SectionTitle>
    <Paragraph position="0"> For our experiment, we used a tree-bank grammar induced from sections 2-21 of the Penn Wall Street Journal text (Marcus et al., 1993), with section 22 reserved for testing. All sentences of length greater than 40 were ignored for testing purposes as done in both C&amp;C and Goodman (1997). We applied the binarization technique described above to the grammar.</Paragraph>
    <Paragraph position="1"> We chose to measure the amount of work done by the parser in terms of the average number of edges popped off the agenda before finding a parse. This method has the advantage of being platform independent, as well as providing a measure of &amp;quot;perfection&amp;quot;. Here, perfection is the minimum number of edges we would need to pop off the agenda in order to create the correct parse. For the binarized grammar, where each popped edge is a completed constituent, this number is simply the number of terminals plus nonterminals in the sentence--- on average, 47.5.</Paragraph>
    <Paragraph position="2"> Our algorithm includes some measures to reduce the number of items on the agenda, and thus (presumably) the number of popped edges.</Paragraph>
    <Paragraph position="3"> Each time we add a constituent to the chart, we combine it with the constituents on either side of it, potentially creating several new edges. For each of these new edges, we check to see if a matching constituent (i.e. a constituent with the same head, start, and end points) already exists in either the agenda or the chart. If there is no match, we simply add the new edge to the agenda. If there is a match but the old parse  of Nj, k is better than the new one, we discard the new parse. Finally, if we have found a better parse of N~,k, we add the new edge to the agenda, removing the old one if it has not already been popped.</Paragraph>
    <Paragraph position="4"> We tested the parser on section section 22 of the WSJ text with various normalization constants r/, working on each sentence only until we reached the first full parse. For each sentence we recorded the number of popped edges needed to reach the first parse, and the precision and recall of that parse. The average number of popped edges to first parse as a function of is shown in Figure 1, and the average precision and recall are shown in Figure 2.</Paragraph>
    <Paragraph position="5"> The number of popped edges decreases as r/ increases from 1.0 to 1.7, then begins to increase again. See Section 5 for discussion of these results. The precision and recall also decrease as r/increases. Note that, because we used a binarized grammer for parsing, the trees produced by the parser contain binarized labels rather than the labels in the treebank. In order to calculate precision and recall, we &amp;quot;debinarized&amp;quot;</Paragraph>
    <Paragraph position="7"> the parser's output and then calculated the figures as usual.</Paragraph>
    <Paragraph position="8"> These results suggest two further questions: Is the higher accuracy with lower r/due in part to the higher number of edges popped? If so, can we gain accuracy with higher r/by letting the parser continue past the first parse (i.e. pop more edges)? To answer these questions, we ran the parser again, this time allowing it to continue parsing until it had popped 20 times as many edges as needed to reach the first parse.</Paragraph>
    <Paragraph position="9"> The results of this experiment are shown in Figure 3, where we plot (precision + recall)/2 (henceforth &amp;quot;accuracy&amp;quot;) as a function of edges. Note that regardless of r/ the accuracy of the parse increases given extra time, but that all of the increase is achieved with only 1.5 to 2 times as many edges as needed for the first parse. For 77 between 1.0 and 1.2, the highest accuracy is almost the same, about 75.2, but this value is reached with an average of slightly under 400 edges when r/ = 1.2, compared to about 650 when r/= 1.0.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML