File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1075_metho.xml

Size: 16,534 bytes

Last Modified: 2025-10-06 14:12:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-1075">
  <Title>Parsing Incomplete Sentences</Title>
  <Section position="3" start_page="0" end_page="365" type="metho">
    <SectionTitle>
2 All-Paths Parsing
</SectionTitle>
    <Paragraph position="0"> Since Earley's first paper \[10\], many adaptations or improvements of his ~flgorithm have been published \[6,5,24,28\]. They are usually variations following some chart parsing schema \[16\].</Paragraph>
    <Paragraph position="1"> In a previous paper \[18\], the author attempted to unify all these results by proposing an Earley-like construction for all-paths interpretation of (non-deterministic) Push-Down-Transducers (PDT). The idea was that left-to-right parsing schemata may usually be expressed as a construction technique for building a recognizing Push-Down-Automaton (PDA) from the CF grammar of the language. This is quite apparent when comparing the PDA constructions in \[12\] to the ctmrt sche,nata of \[16\] which are now a widely accepted reference. Thns a construction proposed for general PDTs is de facto applicable to most left-to-right parsing schemata, and allows in particular the use of well established PDT construction teclmiques (e.g. precedence, LL(k), LR(k) \[8,14,2\]) for general CF parsing.</Paragraph>
    <Paragraph position="2"> In this earlier paper, our basic algorithm is proved correct, and its complexity is shown to be O(n3), i.e. as good as the best general parsing algorithms 2. As is usual with Earley's construction 3, the theoretical complexity bound is rarely attained, and the algorithm behaves linearly most of the time.</Paragraph>
    <Paragraph position="3"> Further optimizations are proposed in \[18\] that improve this behavior.</Paragraph>
    <Paragraph position="4"> Most published variants of Earley's algorithm, including Earley's own, may be viewed as (a sometimes weaker form of) our construction applied to some specific PDA or PDT. This is the ~Theoretically faster algorithms \[29,7\] can achieve O(n ~'4~6) but with an unacceptable constant fi~ctor. Note also that we do not require the grammar to be in Chomsky Normal Form.</Paragraph>
    <Paragraph position="5"> SAnd unlike tabular algorithms such as Cocke-Younger-Kasami's \[13,15, 30,11\].</Paragraph>
    <Paragraph position="6">  explicit strategy of Tomita \[28\] in the special case of LALR(1) PDT construction technique. A notable exception is the very general approach of Shell \[25\], though it is very similar to a Datalog extension \[19\] of the algorithm presented here.</Paragraph>
    <Paragraph position="7"> An essential feature of all-paths parsing algorithms is to be able to produce all possible parses in a concise form, with as much sharing as possible of the common subparses. This is realized in many systems \[6,24,28\] by producing some kind of shared-forest which is a representation of all parse-trees with various sharings of common subparts. In the case of our algorithm, a parse is represented by the sequence of rules to be used in a left-to-right reduction of the input sentence to the initial nonterminal of the gramnmr. Sharing between all possible parses is achieved by producing, instead of an extensionally given set of possible parse sequences, a new CF grammar that generates all possible parse sequences (possibly an infinite number if the grammar of the input language is cyclic, and if the parsed sentence is infinitely ambiguous). With appropriate care, it is also possible to read this ontput grammar as a shared-forest (see appendix A). However its meaningful interpretation as a shared-forest is dependent on the parsing schema (cf. \[12,16\]) used in constructing the PDT that produces it as output. Good definition and understanding of shared forests is essential to properly define and handle the extra processing needed to disambiguate a sentence, in the usual case when the ambiguous CF grammar is uscd only as a parsing backbone \[24,26\]. The structure of shared forests is discussed in \[4\].</Paragraph>
    <Paragraph position="8"> Before and while following the next section, we suggest that the reader looks at Appendix A which contains a detailed example showing an output grammar and the corresponding shared forest for a slightly ambiguous input sentence.</Paragraph>
  </Section>
  <Section position="4" start_page="365" end_page="366" type="metho">
    <SectionTitle>
3 The Basic Algorithm
</SectionTitle>
    <Paragraph position="0"> A formal definition of the extended algorithm for possibly incomplete sentences is given in appendix C. The formal aspect of our presentation of the algorithm is justified by the fact that it allows specialization of the given constructions to specific parsing schema without loss of the correctness and complexity properties, as well as the specialization of the optimization techniques (see \[18\]) established in the general case. The examples presented later were obtained with an adaptation of this general algorithm to bottom-up LALR(1) parsers \[8\].</Paragraph>
    <Paragraph position="1"> Our aim is to parse sentences in the language /:(G) generated by a CF phrase structure grammar G = (V,\]E, YI,~) according to its syntax. The notation used is V for the set of nonterminal, \]E for the set of terminals, YI for the rules, and for the initial nonterminal.</Paragraph>
    <Paragraph position="2"> We assume that, by some appropriate parser construction technique (e.g. \[14,8,2,1\]) we mechanically produce from the grammar G a parser for the language PS:(G) in the form of a (possibly non-deterministic) push-down transducer (PDT) TG.</Paragraph>
    <Paragraph position="3"> The output of each possible computation of the parser is a sequence of rules in H a to be used in a left-to-right reduction of the input sentence (this is obviously equivalent to producing a parse-tree).</Paragraph>
    <Paragraph position="4"> We assume for the PDT 7G a very general formal definition that can fit most usual PDT construction techniques. It o o is defined as an 8-tuple T(~ = (q, E, A, II, 6, q, $, F) where: Q is the set of states, \]E is the set of input word symbols, &amp; is the set of stack symbols, II is the set of output symbols (i.e. rules of G), ~l is the initial state, $ is the initial stack symbol, F is the set of final states, 6 is a finite set of transitions of the 4Implementations usually denote these rules by their index in the set II. form: (pAa~--~ qBu) with p, qEq, A,BEZXU{E&amp;}, aE~U{e~),and uEII*.</Paragraph>
    <Paragraph position="5"> Let the PDT be in a configuration p = (p Aa a~ u) where p is the current state, Aa is the stack contents with h on the top, ax is the remaining input where the symbol a is the next to be shifted and x E ~E*, and u is the already produced output. The application of a transition r = (p A a ~ q B v) results in a new configuration p' = (q Ba x uv) where the terminal symbol a has been scanned (i.e. shifted), A has been popped and n has been pushed, and v has been concatenated to the existing output u.</Paragraph>
    <Paragraph position="6"> If the terminal symbol a is replaced by e:~ in the transition, no input symbol is scanned. If A (resp. B) is replaced by e~ then no stack symbol is popped from (resp. pushed on) the stack.</Paragraph>
    <Paragraph position="7"> Our algorithm consists in an Earley-like 5 simulation of the PDT TG. Using the terminology of \[2\], the algorithm builds an item set Si successively for each word symbol xi holding position i in the input sentence x. An item is constituted of two modes of the form (p A i) where p is a PDT state, A is a stack symbol, and i is the index of an input symbol. The item set Si contains items of the form ((p A i) (q B j)) . These items are used as nonterminals of a grammar ~ = (S, II, P, Uf), where 6' is the set of all items (i.e. the union of St), and the rules in are constructed together with their left-hand-side item by the algorithm. The initial nonterminal Uf of ~ derives on the last items produced by a successful computation.</Paragraph>
    <Paragraph position="8"> The meaning of an item U = ((p A i) (q n j)) is the following: * there are computations of the PDT on the given input sentence that reach a configuration pt where the state is p, the stack top is A and the last symbol scanned is xi; * the next stack symbol is then B and, for all these computations, it was last on top in a configuration p where the state was q and the last symbol scanned was xj; * the rule sequences in l-I* derivable from U in the grammar are exactly those sequences output by the above defined comput~:tions of the PDT between configurations p and p~.</Paragraph>
    <Paragraph position="9"> In simpler words, an item may be understood as a set of distinguished fl'agments of the possible PDT computations, that are independent of the initial content of the stack, except for its top element. Item structures are used to share these fragments between all PDT computations that can use them, so as to avoid duplication of work. In the output grammar an item is a nonterminal that may derive on the outputs produced by the corresponding computation fragments.</Paragraph>
    <Paragraph position="10"> The items may also be read as an encoding of the possible configurations that could be attained by the PDT on the given input, with sharing of common stack fragments (the same fragment may be reused several times for the same stack in the case of cyclic grammars, or incomplete sentences). In figure 1 we represent a partial collection of items. Each item is represented by its two modes as (Kh Kh,) without giving the internal structure of modes as a triples (PDT-state x stack-symbol x inputindex). Each mode Kh actually stands for the triple (pa A h ih). We have added arrows from the second component of every item (Kh Kh,) to the first component of any item (Ku Kh,,). This chaining indicates in reverse the order in which the corresponding modes are encountered during a possible computation of the PDT. In particular, the sequence of stack symbols of the first modes of the items in any such chain is a possible stack content. Ignoring the output, an item (Kh K^,) represent the set of PDT configurations where the current state is p~,, the next input symbol to be read has the index ih + 1, and the stack content is formed of all the stack symbols to be found in the first mode of all items of any chain of items beginning with (Kh Kh,).</Paragraph>
    <Paragraph position="11"> Hence, if the collection of items of figure 1 is produced by a dynamic programming computation, it means that a standard non-deterministic computation of the PDT could have reached 5We assume the reader to be familiar with some variation of Earley's algorithm. Earley's original paper uses the word s~ate instead of i~em.  state I)1, having last read the input symbol of index il, and having buitt any of tile following stack configurations (among others), with tim stack top on the left hand side: A1A2As..., A1A2A3A7 . ., A1A2AaAfA6..., A1A2AsAsAs..., A1A2A4AaAbAs . .., A1A2A4AbAs..., and so on.</Paragraph>
    <Paragraph position="12"> The transitions of tlm PDT are interpreted to produce new items, and new associated rules in 5 deg for the output grammar ~, as described in appendix C. When the same item is produced several times, only one copy is kept in the item set, but a new rule is produced each time. This merging of identical items accounts for the sharing of identical subeomputations. The cotresponding rules with stone left-hand-side (i.e. the multiply pro dueed item) account for santo of the sharing in the output (of.</Paragraph>
    <Paragraph position="13"> appendices A &amp; B). Sharing in the output also appears in the use of the :,ame item in the right hand side of sevcral different output rules. This directly results from the non-determinism of the PDT computation, i.e. the ambiguity of the input sentence.</Paragraph>
    <Paragraph position="14"> The critical feature of the algorithm for handling cyclic rules (i.e. infinite ambiguity) is to be found in the handling of papping transitions 6. When applying a popping transition r = (p A eI:i ~ r e~. z) to the item C = ((p A i) (q la j)) the algarithm mu,*t find all items Y = ((q, j)(s D k)), i.e. all items with first mode (q B j), produced and build for each of then, a new itera V = ((r Jl i) (s D k)) together with the output rule (V-~ YUz) to be added to 70. The subtle point is that the Y-items must be all items with (q B j) as first mode, including those that, when j = i, may be built later in the computation (e.g. because their existence depends on some other V-item built in that step).</Paragraph>
  </Section>
  <Section position="5" start_page="366" end_page="366" type="metho">
    <SectionTitle>
4 Parsing Incomplete Sentences
</SectionTitle>
    <Paragraph position="0"> In order to handle incomplete sentences, we extend the input vocabulary with 2 symbols: &amp;quot;?&amp;quot; standing for one unknown word symbol, and &amp;quot;*&amp;quot; standing for an unknown sequence of input word symbols ~.</Paragraph>
    <Paragraph position="1"> Normally a scanning transition, say (p e a ~ r e z), is applicable to ~tx~ item, say U = ((p A i) (q B j)) in ,-qi, only when a == xi+l, wlmre xi+, is the next input symbol to be shifted. It produces a ,law item in 5:1+1 and a new rule in 7 deg, respectively V ~-: ((rA i+l)(qllj)) and (V-+ Uz) for the above transition and item.</Paragraph>
    <Paragraph position="2"> When the next input symbol to be shifted is xi+l = ? (i.e. the unknown input word symbol), then any scanning transition may 6Popping transitions are also the critical place to look at for ensuring O(n a) worst ease complexity.</Paragraph>
    <Paragraph position="3"> 7Several adjacent &amp;quot;*&amp;quot; are equivalent to a single one. be applied as above independently of the input symbol required by the transition (provided that the transition is applicable with respect to PDT state and stack symbol).</Paragraph>
    <Paragraph position="4"> When the next input symbol to be shifted is x~+l = * (i.e. the unknowlt input subsequence), then the algorithm proceeds as for the unknown word, except that the new item V is created in item set 8~ instead of b'i+l, i.e. V = ((r A i) (q B j)) in the case of the abow; example. Thus, in the presence of the unknown symbol subsequence *, scanning transitions may be applied any number of times to the same computation thread, without shifting the input stream s .</Paragraph>
    <Paragraph position="5"> Scanning transitions are also used normally on input symbol xi+2 so as to produce also itetns in ,S~+:, for example the item ((r A i+2) (q B j)), assuming a =-- xi+~ in the case of the above example 9. This is how computation proceeds beyond the ltllknown subscquenee.</Paragraph>
    <Paragraph position="6"> There is a remaining difficulty due to tile fact that it may be hard to relate a parse sequence of rules in II to the input sentence because of the unknown nmnber of input symbol actually assumed for all occm'rence of the unknown input subsequence.</Paragraph>
    <Paragraph position="7"> We solve this difficulty by including tile input word symbols in their propel&amp;quot; place in parse sequences, which can thus be read as postfix polish encodings of tile parse tree. In such a parse sequence, the symbol * is included a number of times equal to the assumed length of the corresponding unknown input subsequcnce(s) for that parse (cf. appendix B).</Paragraph>
    <Paragraph position="8"> A last point concerns simplification of the resulting grammar (~, or equivalently of the corresponding shared-parse-forest. In practice an unknown subseque, nce may stand for an arbitrarily complex sequence of input word symbols, with a col rcspondingly complex pars(&amp;quot; structure. Since the subsequence is unknown anyway, its hypothetical structures (:all be summarized by the nonterminal symbols that dominate it (thanks to context-fl'eeness).</Paragraph>
    <Paragraph position="9"> Hence the output parse grammar ~ produced by our algorithm may be simplified by replacing with the unknown subsequence terminal *, all nonterminals (i.e. items) that deri,e only on (occurrences of) this symbol. However, to keep the output readable, wc usually qualify these * symbols with the appropriate nonterminal of tile parsed language grammar G. The substructures thus eliminated can be retrieved by arbitrary l~e of the original CF grammar of the parsed language, whici~ thus complements the simplified output gramma.P deg. An example i,~; given in appendix B.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML