File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/p89-1018_intro.xml
Size: 12,774 bytes
Last Modified: 2025-10-06 14:04:48
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1018"> <Title>The Structure of Shared Forests in Ambiguous Parsing</Title> <Section position="3" start_page="143" end_page="144" type="intro"> <SectionTitle> 2 A Uniform Framework </SectionTitle> <Paragraph position="0"> To discus* the above issue* in a uniform way, we need a genera\] framework that encompasses all forms of chart parsing and shared forest building in a unique formalism. We shall take a* a l~sk a formalism developed by the second author in previous papers \[15,16\]. The idea of this approach is to separate the dynamic programming construct* needed for efficient chart parsing from the chosen parsing schema. Comparison between the classifications of Kay \[14\] and Gritfith & Petrick \[10\] shows that a parsing schema (or parsing strategy) may be expressed in the construction of a Push-Down Transducer (PDT), a well studied formalization of left-to-right CF parsers 5. These PDTs are usually non-deterministic and cannot be used as produced for actual parsing. Their backtrack simulation does not alway* terminate, and is often time-exponential when it does, while breadth-first simulation is usually exponential for both time and space. However, by extending Earley's dynamic programming construction to PDTs, Long provided in\[15\] a way of simulating all possible computations of any PDT in cubic time and space complexs Grifllth & Petrick actually use Turing ma,'hines for pedagogical reasons.</Paragraph> <Paragraph position="1"> ity. This approach may thus be used as a uniform framework for comparing chart parsers s.</Paragraph> <Section position="1" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 2.1 The algorithm </SectionTitle> <Paragraph position="0"> The following is a formal overview of parsing by dynamic programming interpretation of PDT*.</Paragraph> <Paragraph position="1"> Our ahn is to parse sentences in the language PS(G) generated by a CF phrase structure grammar G -- (V, ~, H, N) according to its syntax. The notation used is V for the set of nontermln~l, ~ for the set of terminals, H for the rules, for the initial nonterminal, and e for the empty string.</Paragraph> <Paragraph position="2"> We assume that, by some appropriate parser construction technique (e.g. \[12,6,1\]) we mechanically produce from the grammar G a parser for the language PS(G) in the form of a (possibly non-deterministic) push.down transducer (PDT) T G. The output of each possible computation of the parser is a sequence of rules in rl ~ to be used in a left-to-right reduction of the input sentence (this is obviously equivalent to producing a parse-tree).</Paragraph> <Paragraph position="3"> We assume for the PDT T G a very general formal definition that can fit most usual PDT construction techniques. It is defined as an 8-tuple T G -- (Q, \]~, A, H, 6, ~, ;, F) where: Q is the set of states, ~ is the set of input word symbols, A is the set of stack symbols, H is the set of output symbols s (i.e. rule* of G), q is the initial state, $ is the initial stack symbol, F is the set of final states, 6 is a fnite set of transitions of the form: (p A a ~-* q B u) with p, q E Q, x,s C/ A u {e}, a E ~: u {~}, and . ~ H*.</Paragraph> <Paragraph position="4"> Let the PDT be in a configuration p -- (p Aa az u) where p is the current state, Aa is the *tack contents with A on the top, az is the remaining input where the symbol a is the next to be shifted and z E ~*, and u is the already produced output. The application of a transition r = (p A a ~-* qB v) result* in a new configuration p' ---- (q Bot z uv) where the terminal symbol a has been scanned (i.e. shifted), A has been popped and B has been pushed, and t, has been concatenated to the existing output ,~ If the terminal symbol a is replaced by e in the transition, no input symbol is scanned. If A (reap.</Paragraph> <Paragraph position="5"> B) is replaced by * then no stack symbol is popped from (resp.</Paragraph> <Paragraph position="6"> pushed on) the *tack.</Paragraph> <Paragraph position="7"> Our algorithm consist* in an Earley-like 9 simulation of the PDT T G. Using the terminology of \[1\], the algorithm builds an item set ,~ successively for each word symbol z~ holding position i in the input sentence z. An item is constituted of two modes of the form (p A i) where p is a PDT state, A is a stack symbol, and i.is the index of an input symbol.</Paragraph> <Paragraph position="8"> The item set & contains items of the form ((p A i) (q B j)) .</Paragraph> <Paragraph position="9"> These item* are used as nontermineds of an output grammar S The original intent of \[15\] was to show how one can generate efficient general CF chart parsers, by first producing the PDT with the efllcient techniques for deterministic parsing developed for the compiler technology \[6,12,1\]. This idea was later successfu/ly used by Tomits \[31\] who applied it to LR(1) parsers \[6,1\], and later to other puelulown based parsers \[32\].</Paragraph> <Paragraph position="10"> 7 Implomczxtations usually dc~ote these rules by their index in the set rl.</Paragraph> <Paragraph position="11"> s Actual implementations use output symbols from rIu~, since rules alone do not distinguish words in the same lexical category. s We asmune the reader to be familiar with some variation of Earley's algorithm. Earley's original paper uses the word stere (from dynamic programming terminology) instead of item.</Paragraph> <Paragraph position="12"> = (8, l'I, ~, U~), where 8 is the set of all items (i.e. the union of &), and the rules in ~ are constructed together with their left-hand-side item by the algorithm. The initial nonterminal Ut of ~ derives on the last items produced by a successful computation.</Paragraph> <Paragraph position="13"> Appendix A gives the details of the construction of items and rules in G by interpretation of the transitions of the PDT. More details may be found in \[15,16\].</Paragraph> </Section> <Section position="2" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 2.2 The shared forest </SectionTitle> <Paragraph position="0"> An apparently major difference between the above algorithm and other parsers is that it represents a parse as the string of the grammar rules used in a leftmost reduction of the parsed sentence, rather than as a parse tree (cf. section 4). When the sentence has several distinct paxses, the set of all possible parse strings is represented in finite shared form by a CF grammar that generates that possibly infinite set. Other published algorithms produce instead a graph structure representing all paxse-trees with sharing of common subpaxts, which corresponds well to the intuitive notion of a shared forest.</Paragraph> <Paragraph position="1"> This difference is only appearance. We show here in section 4 that the CF grammar of all leftmost parses is just a theoretical formalization of the shared.forest graph. Context-Free grammars can be represented by AND-OR graphs that are closely related to the syntax diagrams often used to describe the syntax of programming languages \[37\], and to the transition networks of Woods \[22\]. In the case of our grammar of leftmost parses, this AND-OR graph (which is acyclic when there is only finite ambiguity) is precisely the shaxedforest graph. In this graph, AND-nodes correspond to the usual parse-tree nodes, whil~ OR-nodes correspond to xmbiguities, i.e. distinct possible subtrees occurring in the same context. Sharing ofsubtrees in represented by nodes accessed by more than one other node.</Paragraph> <Paragraph position="2"> The grammar viewpoint is the following (cf. the example in section 4). Non-terminal (reap. terminal) symbols correspond to nodes with (reap. without) outgoing arcs. AND-nodes correspond to right-hand sides of grammar rules, and OR-nodes (i.e. ambiguities) correspond to non-terminals defined by several rules. Subtree sharing is represented by seVo eral uses of the same symbol in rule right-hand sides.</Paragraph> <Paragraph position="3"> To our knowledge, this representation of parse-forests as grammars is the simplest and most tractable theoretical formalization proposed so far, and the parser presented here is the only one for which the correctness of the output grammar -- i.e. of the shared-forest -- has ever been proved.</Paragraph> <Paragraph position="4"> Though in the examples we use graph(ical) representations for intuitive understanding (grammars axe also sometimes represented as graphs \[37\]), they are not the proper formal tool for manipulating shared forests, and developing formalized (proved) algorithms that use them. Graph formalization is considerably more complex and awkward to manipulate than the well understood, specialized and few concepts of CF grammars. Furthermore, unlike graphs, this grammar formalization of the shared forest may be tractably extended to other grammatical formalisms (ct: section 5).</Paragraph> <Paragraph position="5"> More importantly, our work on the parsing of incomplete sentences \[16\] has exhibited the fundamental character of our grammatical view of shared forests: when parsing the completely unknown sentence, the shared forest obtained is precisely the complete grammar of the analyzed language.</Paragraph> <Paragraph position="6"> This also leads to connections with the work on partial evalnation \[8\].</Paragraph> </Section> <Section position="3" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 2.3 The shape of the forest </SectionTitle> <Paragraph position="0"> For our shared-forest, x cubic space complexity (in the worst case -- space complexity is often linear in practice) is achieved, without requiring that the language grammar be in Chonmky Normal Form, by producing a grammar of parses that has at most two symbols on the right-hand side of its rules. This amounts to representing the list of sons of a parse tree node as a Lisp-like list built with binary nodes (see figures 1 L- 2), and it allows partial sharing of the sons i0 The structure of the parse grammar, i.e. the shape of the parse forest, is tightly related to the parsing schema used, hence to the structure of the possible computation of the non-deterministic PDT from which the parser is constructed.</Paragraph> <Paragraph position="1"> First we need a precise characterization of parsing strategies, whose distinction is often blurred by superimposed optimizations. We call bottom-up a strategy in which the PDT decides on the nature of a constituent (i.e. on the grammar rule that structures it), after having made this decision first on its subconstituents. It corresponds to a postfix left-to-right walk of the parse tree. Top-Down parsing recognizes a constituent before recognition of its subconstituents, and corresponds to a prefix walk. Intermediate strategies are also possible.</Paragraph> <Paragraph position="2"> The sequence of operations of a bottom-up parser is basically of the following form (up to possible simplifying oi>. timizations): To parse a constituent A, the parser first parses and pushes on the stack each sub-constituent B~; at some point, it decides that it has all the constituents of A on the stack and it pops them all, and then it pushes A and outputs the (rule number ~- of the) recognized rule</Paragraph> <Paragraph position="4"> of such a sequence results in a shared forest containing parse-trees with the shape described in figure 1, i.e. where each node of the forest points to the beginning of the llst of its sons.</Paragraph> <Paragraph position="5"> A top-down PDT uses a different sequence of operations, detailed in appendix B, resulting in the shape of figure 2 where a forest node points to the end of the list of sons, which is itself chained backward. These two figures are only simple examples. Many variations on the shape of parse trees and forests may be obtained by changing the parsing schema.</Paragraph> <Paragraph position="6"> Sharing in the shared forest may correspond to sharing of a complete subtree, but also to sharing of a tail of a llst of sons: this is what allows the cubic complezity. Thus bottom-up parsing may share only the rightmost subconstituents of a constituent, while top-down parsing may share only the left-most subconstituents. This relation between parsing schema and shape of the shared forest (and type of sharing) is a consequence of intrinsic properties of chart parsing, and not of our specific implementation.</Paragraph> <Paragraph position="7"> It is for example to be expected that the bidirectional nature of island parsing leads to irregular structure in shared forests, when optimal sharing is sought for.</Paragraph> </Section> </Section> class="xml-element"></Paper>