File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1026_intro.xml

Size: 4,533 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1026">
  <Title>A Tabulation-Based Parsing Method that Reduces Copying</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper addresses the cost of copying edges in memoization-based, all-paths parsers for phrase-structure grammars. While there have been great advances in probabilistic parsing methods in the last five years, which find one or a few most probable parses for a string relative to some grammar, all-paths parsing is still widely used in grammar development, and as a means of verifying the accuracy of syntactically more precise grammars, given a corpus or test suite.</Paragraph>
    <Paragraph position="1"> Most if not all efficient all-paths phrase-structure-based parsers for natural language are chart-based because of the inherent ambiguity that exists in large-scale natural language grammars. Within WAM-based Prolog, memoization can be a fairly costly operation because, in addition to the cost of copying an edge into the memoization table, there is the additional cost of copying an edge out of the table onto the heap in order to be used as a premise in further deductions (phrase structure rule applications). All textbook bottom-up Prolog parsers copy edges out: once for every attempt to match an edge to a daughter category, based on a matching end-point node, which is usually the first-argument on which the memoization predicate is indexed. Depending on the grammar and the empirical distribution of matching mother/lexical and daughter descriptions, this number could approach a2a4a3a6a5 copies for an edge added early to the chart, where a2 is the length of the input to be parsed.</Paragraph>
    <Paragraph position="2"> For classical context-free grammars, the category information that must be copied is normally quite small in size. For feature-structure-based grammars and other highly lexicalized grammars with large categories, however, which have become considerably more popular since the advent of the standard parsing algorithms, it becomes quite significant. The ALE system (Carpenter and Penn, 1996) attempts to reduce this by using an algorithm due to Carpenter that traverses the string breadth-first, right-to-left, but matches rule daughters rule depth-first, left-to-right in a failure-driven loop, which eliminates the need for active edges and keeps the sizes of the heap and call stack small. It still copies a candidate edge every time it tries to match it to a daughter description, however, which can approach a0 a2 a3 a5a2a1a4a3 because of its lack of active edges. The OVIS system (van Noord, 1997) employs selective memoization, which tabulates only maximal projections in a head-corner parser -- partial projections of a head are still recomputed. null A chart parser with zero copying overhead has yet to be discovered, of course. This paper presents one that reduces this worst case to two copies per non-empty edge, regardless of the length of the input string or when the edge was added to the chart.</Paragraph>
    <Paragraph position="3"> Since textbook chart parsers require at least two copies per edge as well (assertion and potentially matching the next lexical edge to the left/right), this algorithm always achieves the best-case number of copies attainable by them on non-empty edges. It is thus of some theoretical interest in that it proves that at least a constant bound is attainable within a Prolog setting. It does so by invoking a new kind of grammar transformation, called EFD-closure, which ensures that a grammar need not match an empty category to the leftmost daughter of any rule. This transformation is similar to many of the myriad of earlier transformations proposed for exploring the decidability of recognition under various parsing control strategies, but the property it establishes is more conservative than brute-force epsilon elimination for unification-based grammars (Dymetman, 1994). It also still treats empty categories distinctly from non-empty ones, unlike the linking tables proposed for treating leftmost daughters in left-corner parsing (Pereira and Shieber, 1987). Its motivation, the practical consideration of copying overhead, is also rather different, of course.</Paragraph>
    <Paragraph position="4"> The algorithm will be presented as an improved version of ALE's parser, although other standard bottom-up parsers can be similarly adapted.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML