XML Viewer - e93-1045

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1045_metho.xml
Size: 33,797 bytes
Last Modified: 2025-10-06 14:13:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1045">
  <Title>The Use of Shared Forests in Tree Adjoining Grammar Parsing*</Title>
  <Section position="4" start_page="0" end_page="385" type="metho">
    <SectionTitle>
2 Tree Adjoining Grammars
</SectionTitle>
    <Paragraph position="0"> Ta~ is a tree generating formalism introduced in \[Joshi et al., 1975\]. A tag is defined by a finite set of elementary trees that are composed by means of the operations of tree adjunction and substitution.</Paragraph>
    <Paragraph position="1"> In this paper, we only consider the use of the adjunction operation.</Paragraph>
    <Paragraph position="2">  where Vjv is a finite set of nonterminals symbols, VT is a finite set of terminal symbols, S E V/v is the start symbol, I is a finite set of initial trees, A is a finite set of auxiliary trees.</Paragraph>
    <Paragraph position="3"> An initial tree is a tree with root labeled by S and internal nodes and leaf nodes labeled by nonterminal and terminal symbols, respectively. An auxiliary tree is a tree that has a leaf node (the foot node) that is labeled by the same nonterminal that labels the root node. The remaining leaf nodes are labeled by terminal symbols and all internal nodes are labeled by nonterminals. The path from the root node to the foot node of an auxiliary tree is called the spine of the auxiliary tree. An elementary tree is either an initial tree or an auxiliary tree. We use a to refer to initial trees and/3 for auxiliary trees.</Paragraph>
    <Paragraph position="4"> A node of an elementary tree is called an elementary node and is named with an elementary node address. An elementary node address is a pair comprising of the name of the elementary tree to which the node belongs and the address of the node within that tree. We will assume the standard addressing scheme: the root node has an address c; if a node with address /~ has /C/ children then the \]c children (in left to right order) have addresses p * 1,..., p. k. Thus, for each address p we have p E A/'* where .hf is the set of natural numbers. In this section we use p to refer to addresses and r I to refer to elementary node addresses. In general, we can write 1/=~ 7, P where 7 is an elementary tree and p E dom (7) and dora (7) is the set of addresses of the nodes in 7.</Paragraph>
    <Paragraph position="5"> Let 7 be a tree with internal node labeled by a nonterminal A. Let/3 be an auxiliary tree with root and foot node labeled by the same nonterminal A.</Paragraph>
    <Paragraph position="6"> The tree, 7 ~, that results from the adjunction of/3 at the node in 7 labeled A is formed by removing the subtree of 7 rooted at this node, inserting/3 in its place, and substituting it at the foot node of/3. Each elementary node is associated with a selective adjoining (SA) constraint that determines the set of auxiliary trees that can be adjoined at that node. In addition when adjunction is mandatory at a node it is said to have an obligatory adjoining (OA) constraint. Whether/3 can be adjoined at the node (labeled by A) in 7 is determined by the SA constraint of the node. In 7 t the nodes contributed by/3 have the same constraints as those associated with the corresponding nodes in/3. The remaining nodes in 7 ~ have the constraints of the corresponding nodes in 7.</Paragraph>
    <Paragraph position="7"> Given p E dom(7), by Ibl(7,p) we refer to the label of the node addressed # in 7. Similarly, we will use sa(7, p) and oa(7, p) to refer to the SA and OA constraints of a node addressed p in a tree 7. Finally, we will use ft (/3) to refer to the address of the foot node of an auxiliary tree/3.</Paragraph>
    <Paragraph position="8"> adj (7, P,/3) denotes the tree that results from the adjunction of/3 at the node in 7 with address p. This is defined when fl E sa(7, p). If adj (% #,/3) = 7 ~ then the nodes in 7 ~ are defined as follows.</Paragraph>
    <Paragraph position="10"> not equal to or dominated by the node addressed p in 7) then</Paragraph>
    <Paragraph position="12"> - I~l(~',~. ft(/3). ~) = mbK%~-m), - sa(-f', i'&amp;quot; ft (/3). l'~) = s~('r, ~,&amp;quot; ~,~), - oa(7',p, ft (/3). Pl) = oa(7,p. Pl),  In general, if p is the address of a node in 7 then &lt; 7, P &gt; denotes the elementary node address of the node that contributes to its presence, and hence its label and constraints.</Paragraph>
    <Paragraph position="13"> The tree language, T(G), generated by a TAG, G, is the set of trees derived starting from an initial tree such that no node in the resulting tree has an OA constraint. The (string) language, L(G), generated by a TAG, G, is the set of strings that appear on the frontier of trees in T(G).</Paragraph>
    <Paragraph position="14"> Example 2.1 Figure 1 gives a TAG, G, which generates the language {wcw \[ w E {a,b}*}. The constraints associated with the root and foot of/3 specify that no auxiliary trees can be adjoined at these nodes. This is indicated in Figure 1 by associating the empty set, ~, with these nodes. An example derivation of the strings aca and abeab is shown in Figure 2.</Paragraph>
    <Paragraph position="16"/>
  </Section>
  <Section position="5" start_page="385" end_page="386" type="metho">
    <SectionTitle>
3 Linear Indexed Grammars
</SectionTitle>
    <Paragraph position="0"> An indexed grammar \[Aho, 1968\] can be viewed as a cfg in which objects are nonterminals with an associated stack of symbols. In addition to rewriting nonterminals, the rules of the grammar can have the effect of pushing or popping symbols on top of the stacks that are associated with each nonterminal.</Paragraph>
    <Paragraph position="1"> In \[Gazdar, 1988\] a restricted form of indexed grammars was discussed in which the stack associated with the nonterminal on the left of each production can only be associated with one of the occurrences of nonterminals on the right of the production. Stacks of bounded size are associated with other occurrences of nonterminals on the right of the production. We call this linear indexed grammars (lig}. Lig generate the same class of languages as tag \[Vijay-Shanker and Weir, in pressa\].</Paragraph>
    <Paragraph position="3"> where Vlv is a finite set of nonterminals, VT is a finite set of terminals, VI is a finite set of indices (stack symbols), S * VN is the start symbol, and P is a finite set of productions.</Paragraph>
    <Paragraph position="4"> Given a lig, G = (V~C/, VT, VI, S, P), we define the set of objects of G as</Paragraph>
    <Paragraph position="6"> We use A\[oo a\] to denote the nonterminal A associated with an arbitrary stack with the string a on top and A\[\] to denote that an empty stack is associated with A. We use T to denote strings in (Vc(G)UVT)*.</Paragraph>
    <Paragraph position="7"> The general form of a lig production is: A\[oo a\] ---* TB\[oo a'\]T' where A, B e VN, a, a' G VI* and T, T' G (Vc(C)U VT)*. Given a grammar, G = (V1v, VT, VI, S, P), the derivation relation, o=~, is defined such that if A\[oo a\] --~ TB\[oo a'\]T' G P then for every fle V\[ and TI,T2 * (Vc(G) U VT)*:</Paragraph>
    <Paragraph position="9"> As a result of the linearity in the rules, the stack ~/a associated with the object in the left-hand side of the derivation and the stack j3cJ associated with one of the objects in the right-hand side have the initial part fl in common. In the derivation above, we say that the object BLSa' \] is the distinguished child of ALSa \]. Given a derivation, the distinguished descendant relation is the reflexive, transitive closure of the distinguished child relation.</Paragraph>
    <Paragraph position="10"> The language generated by a lig, G is: where ~ denotes the reflexive, transitive closure</Paragraph>
  </Section>
  <Section position="6" start_page="386" end_page="386" type="metho">
    <SectionTitle>
4 Parsing as Intersection with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="386" end_page="386" type="sub_section">
      <SectionTitle>
Regular Languages
</SectionTitle>
      <Paragraph position="0"> In the case of cfg parsing, \[Billot and Lung, 1989; Lang, 1992\] show that a cfg can be used to encode all of the parses for a given string. For example, let Go be a grammar and let the string w = al ... an be in L(Go). All parses for the string w can be represented by the shared forest grammar G~. The nonterminals in Gw are of the form (A, i, j) where A is a nonterminal of Go and 0 &lt; i &lt; j &lt; n. The construction of G~0 is such that any derivation from (A, i, j) encodes a derivation A ::~ ai+l...aj Oo For instance, suppose A .--, BC is a production in Go that is used in the first step of a derivation of the substring ai+l...a/ from A. Corresponding to this production, Gw contains a production (A, i,j) -.-* (B, i, k)(C, k,j) for each 0_&lt; i&lt; k &lt; j &lt; n. This can be used to encode all parses of ai+x ... aj from A where B ::~ ai+l...a~ and C -~ a~+t...aj In general, corresponding to a production A-+ X1...Xr in Go the grammar G~ contains a production (A, il,j,) --* (X1, il,jl)... (X,, it,j,) for every il,jl,...,i,,j~ E { 1,...,n} such that for each 1 _&lt; k &lt; r if X~ E VT then ik + 1 = jk, otherwise ik+l &lt; jk. Additionally, G~ includes the production</Paragraph>
      <Paragraph position="2"> Note that the number of nonterminals in the shared forest grammar, Gw, is O(n 2) and the number of productions is O(n re+l) where Iw I = n and m is the maximum number of nonterminals in the right-hand-side of a production in Go. Therefore, if the object grammar were in Chomsky normal form, the number of productions is O(nZ).</Paragraph>
      <Paragraph position="3"> Lung \[1992\] extended this by showing that parsing a string w according to a grammar G can be viewed as intersecting the language L(G) with the regular language { w }. Suppose we have an object context-free grammar Go and some deterministic finite state automaton M. For the sake of simplicity, let us assume that Go is in Chomsky normal form. The standard proof that context-free languages are closed under intersection with regular languages, constructs a context-free grammar for L(Go) f3 L(M) with a production null (A,p, q) -. (B,p, r)(C, r, q) for each production A --~ BC of Go and states p, q, r of M. Also for each terminal a the production (a,p, q) --~ a will be included if and only if 6(p, a) = q where/~ is the transition function of M.</Paragraph>
      <Paragraph position="4"> Lung \[1992\] applied this to cfg recognition as follows. Given an input, w - al...an, define the dfa M~ such that L(M~ ) - { w }. The state set of Mw is { 0, 1,...,n }; the transition function 5 is such that 6(i, ai+l) = i + 1 for each 0 _&lt; i &lt; n; 0 is the initial state; and n is the final state. The shared forest grammar G~ is obtained when the standard intersection construction described above is applied to Go and Mw. Furthermore, since L(Gw) = L(Go) N L(M,~) and L(M,~) = {w}, we have w E L(Go) if and only if L(G,~) is not the empty set. That is, the original recognition problem can be turned into one of generating the shared forest grammar, Gw, and deciding whether the start nonterminal, (S, 0, n), of Gw is an useful symbol, i.e., whether there is some terminal string z such that</Paragraph>
      <Paragraph position="6"> Here S has been taken to be the start nonterminal of Go. Note that Gw can be constructed in O(n s) time and &amp;quot;recognition&amp;quot; can also be accomplished within this time bound.</Paragraph>
      <Paragraph position="7"> One advantage that arises from viewing parsing as intersection with regular languages is that exactly the same algorithm can be given a word net (a regular language that is not a singleton) rather than a single word as input. This could be useful if we wish to deal with ill-formed inputs.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="386" end_page="389" type="metho">
    <SectionTitle>
5 Derivation versus Derived Trees in
TAG
</SectionTitle>
    <Paragraph position="0"> For grammar formalisms involving the derivation of trees, a tree is called a derived tree with respect to a given grammar if it can be derived using the rewriting rules of the grammar. A derivation tree of the grammar, on the other hand, is a tree that encodes the sequence of rewritings used in deriving a derived tree. In the case of cfg, a tree that is derived contains all the information about its derivation and there is no need to distinguish between derivation trees and derived trees. This is not always the case. In particular, for a tree-rewriting system like tag we need to distinguish between derived and derivation trees.</Paragraph>
    <Paragraph position="1"> In fact there are at least two ways one can encode tag derivation trees. The first (see \[Vijay-Shanker, 1987\]) captures the fact that derivations in tag are conte~t-free, i.e., the trees that can be adjoined at a node can be determined a priori and are not dependent on the derivation history. We capture this context-freeness by giving a cfg to represent the set of all possible derivation sequences in a tag. An alternate scheme uses a tag or a lig (see \[Vijay-Shanker  and Weir, in pressb\]) to represent the set of all possible derivations.</Paragraph>
    <Paragraph position="2"> We briefly consider the first scheme to show how given a tag, Go and a string, w, context-free grammar can be used to represent shared forests. In later sections we will study the second scheme using lig for shared forests.</Paragraph>
    <Paragraph position="3">  and a string w - ax...an we construct a context-free grammar, Gto such that L(G,~) ~ d~ if and only if w E L(Go). Let M~ be the dfa for w described in Section 4.</Paragraph>
    <Paragraph position="4"> Consider a tree fl that has been derived from some auxiliary tree in A. Let the string on the frontier of fl that is to the left of the foot node be us and the string to the right of the foot node be ur. Consider the tree that results from the adjunction of/3 at a node in with elementary node address I T/where v is the string on the frontier of the subtree rooted at ,7. After adjunction the strings us and ur will appear to the left and right (respectively) of v.</Paragraph>
    <Paragraph position="5"> Suppose that in a derivation of the string w by the grammar Go the strings ul and ur form two continuous substrings w: i.e., uz = ai+l...ap and ur = aq+l...aj for some 0 &lt; i &lt; p&lt; q &lt; j &lt; n.</Paragraph>
    <Paragraph position="6"> Thus, according to the definition of M~ we would have ~(i, us) = p and 6(q, ur) = j. Hence, we can use the four states i, j, p and q of Mr0 to account for which parts of w are spanned by the frontier of ft.</Paragraph>
    <Paragraph position="7"> Since the string appearing at the subtree rooted at 7/is v then if 6(p, v) = q we have 6(i, usvur) = j and p and q identify the substring of w that is spanned by the subtree rooted at 7/. However, the node T/may be on the spine of some auxiliary tree, i.e., on the path from the root to the foot node. In that case we will have to view the frontier of the subtree rooted at r/ as comprising two substrings, say vl and vr to the left and right of the foot node, respectively. The two states p, q of Mw are do not fully characterize the frontier of subtree rooted at I/. We need four states, sayp, q, r, s, where 6(p, vs ) = r and 6( s, vr ) = s. Note that the four states in question only characterize the frontier of subtree rooted at T/ before the adjunction of fl takes place. The four states i, j, r, s characterize the situation after adjunction of fl since 6(i, ut) = p,</Paragraph>
    <Paragraph position="9"> In the shared forest cfg Gw the derivation of the 1Rather than repeatedly saying a node with an elementary node address y/, henceforth we simply refer to it as the node 7/.</Paragraph>
    <Paragraph position="10"> string at frontier of tree rooted at ~/before adjunction will be captured by the use of a nonterminal of the form (l, rhp, q,r,s ) and the situation after adjunction will be characterized by (T, T/, i,j, r, s). We use the symbols T and .L to capture the fact that consideration of a node involves two phases: (i) the T part where we consider adjunction at a node, and (ii) the I part where we consider the subtree rooted at this node. Note that the states r, s are only needed when 0 is a node on the spine of an auxiliary tree.</Paragraph>
    <Paragraph position="11"> When this is not the case we let r = s = -.</Paragraph>
    <Paragraph position="12"> Since we have characterized the frontier of fl (i.e., the subtree rooted at the root/), the root of fl) by the four states i, j, p, q, we can use the nonterminal (T, roots, i, j, p, q) and can capture the derivation involving adjunction of/3 at ~/by a production of the form (T, 'I, i, j, r, s) --~ (T, root/), i, j, p, q) (1, r h p, q, r, s) Without further discussion, we will give the productions of Gw. For each elementary node 7/do the following.</Paragraph>
    <Paragraph position="13"> Case 1: When 7/is a node that is labeled by a terminal a, add the production (T, Ti, p,q,-,-)--, a if and only if 6(p, a) = q.</Paragraph>
    <Paragraph position="14"> Case 2a: Let T}I and T/2 be the children of ~1 and the left-child zh dominates the foot node then add the production (l,TI, i,j,p,q)--. (T, Th, i,k,p,q)(T,~,k,j,-,- ) if neither children dominate the foot node then add the production (.L, rhi, j,-,-) --* (r, ql, i,k,-,-)(Y, rl2, k,j,-,-) Case 2b: Let 7/1 and 02 be the children of r/and the right-child 7/2 dominates the foot node then add the production  (+-,Ti, i,j,p,q)--~ (T, TIy,i,k,-,-)(T, Tl2, k,j,p,q) Case 3: When 7/is a nonterminal node that does not have an OA constraint, then to capture the fact that it is not necessary to adjoin at this node, we add (T, Th i, j,p,q)--~ (+-,lh i, j,p,q) Case 4a: When 0 is a node where fl can be adjoined and root/) is the root node of fl add the production (T,~I,i,j,r,s)--* (T, root/),i,j,p,q)(.L,~I,p,q,r,s)  If t/is the root of an initial tree then add the production null S --~ (T, r/, O, n,-,-).</Paragraph>
    <Paragraph position="15"> where S is the start symbol of Gw.</Paragraph>
    <Paragraph position="16"> Note that (cases 2a and 2b) we are assuming binary branching merely to simplify the presentation. We can use a sequence of binary cfg productions to encode situations where t/has more than two children. That is, even if the object-level grammar was not binary branching, the shared forest grammar can still be.</Paragraph>
    <Paragraph position="17"> Note that since the state set of Mw is {0,..., n}, the number of nonterminals in Go is O(n4). Since there are at most three nonterminals in a production, there are at most six states involved in a production. Therefore, the number of productions is O(n 6) and construction of this grammar takes O(n 6) time. Although the derivations of Gto encode derivations of the string w by Go the specific set of terminal strings that is generated by G,o is not important. We do however have L(G~) # ~b if and only if w E L(Go). As before, we can determine whether L(G~) # ~ by checking whether the start nonterminal S is useful. Furthermore this can be detected in time and space linear to the size of the grammar. Since w E L(Go) if and only if L(Gto) # (h, recognition can be done in O(n 6) time and space.</Paragraph>
    <Paragraph position="18"> Once we have found all the useful symbols in the grammar we can prune the grammar by retaining only those productions that have only useful symbols. Since Gto is a cfg and since we can now guarantee that every nonterminal can derive a terminal string and therefore using any production will yield a terminal string eventually, the derivations of w in Go can be read off by simply reading off derivations in Gw.</Paragraph>
    <Paragraph position="19"> 7 Using LIG for Shared Forests We now present an alternate scheme to represent the derivations of a string w from a given object tag grammar Go. In later sections show how it can be used for solving the recognition problem and how a single parse can be extracted.</Paragraph>
    <Paragraph position="20"> The scheme presented in Section 6 that produced a cfg shared forest grammar captured the context-freeness of tag derivations. The approach that we now consider captures an alternative view of tag derivations in which a derivation is viewed as sensitive to the derivation history. In particular, the control of derivation can be captured with the use of additional stack machinery. This underlies the use of lig to represent the shared forests.</Paragraph>
    <Paragraph position="21"> In order to understand how a lig can be used to encode a tag derivation, consider a top-down derivation in the object grammar as follows. A tag derivation can be seen as a traversal over the elementary trees beginning at the root of one of the initial trees. Suppose we have reached some elementary node t/. We must first consider adjunction at t/ and after that we must visit each of t/'s subtrees from left to right. When we first reach 7/we say that we are in the top phase of 1/. The derivation lig encodes this with the nonterminal T associated with a stack whose top element is t/. After having considered adjunction at r/ we are in the bottom phase of 7/. The derivation lig encodes this with the nonterminal _L associated with a stack whose top element is 7/.</Paragraph>
    <Paragraph position="22"> When considering adjunction at r/we may have a choice of either not adjoining at all or selecting some auxiliary tree to adjoin. If the former case we move directly to the bottom phase of r/. In the latter case we move to (visit) the root of the auxiliary tree f/ that we have chosen to adjoin. Once we have finished visiting the nodes of f/(i.e., we have reached the foot of 3) we must return to (the bottom phase of) t/.</Paragraph>
    <Paragraph position="23"> Therefore, it is necessary, while visiting the nodes in ~ to store the adjunction node t/. This can be done by pushing ~/onto the stack at the point that we move to the root of ~. Note that the stack may grow to unbounded length since we might adjoin at a node within ~, and so on. When we reach the bottom phase of foot node of 3 the stack is popped and we find the node at which 3 was adjoined at the top of the stack.</Paragraph>
    <Paragraph position="24"> gFrom the above discussion it is clear that the lig needs just two nonterminals, T and _L. At each step of a derivation in the lig shared forest grammar the top of the stack will specify the node being currently being visited. Also, if the node r/being visited belongs to an auxiliary tree and is on its spine we can expect the symbol below the top of the stack to give us the node where 3 is adjoined. If r/is not on the spine of an auxiliary tree then it is the only symbol on the stack.</Paragraph>
    <Paragraph position="25"> We now show how the lig shared forest grammar can be constructed for a given string w = at ...an. Suppose we have a tag</Paragraph>
    <Paragraph position="27"> that generates the intersection of L(G) and L(Mw).</Paragraph>
    <Paragraph position="28"> P includes the following set of productions for the start symbol S' iS'\[\] .---, (T, qo, q/)\[r/\] I q; e F and t/is root of initial tree In addition, for each elementary node t/do the following. null  Case 1: When , is a node that is labeled by a terminal a P includes the production (T, p, q)\[ti\] ~ a for each p, q E Q such that q E 6(p, a). Case 2a: When ti1 and .2 are the children of a node . such that the left sibling ti1 is on the spine or neither child is on the spine, P includes the production (/, p, q)\[oo .\] ~ (T, p, r)\[oo .1\] (T, r, q)\[.2\] for each p, q, r E Q. Note that the stack of adjunction points must be passed to the ancestor of the foot node all the way to the root.</Paragraph>
    <Paragraph position="29"> Case 2b: When ti1 and ~/~ are the children of a node ~/such that the right sibling T/2 is on the spine P includes the production (_L, p, q)\[oo .\] ~ (T, p, r)\[ti1\] (T, r, q)\[oo .2\] for each p, q, r E Q.</Paragraph>
    <Paragraph position="30"> Case 3: When r} is a nonterminal node that does not have an OA constraint P includes the production (T,p, q)\[oo.\] --~ (_L,p, q)\[oo 7/\] for each p, q E Q. This production is used when no adjunction takes place and we move directly between the top and bottom phases of 77.</Paragraph>
    <Paragraph position="31"> Case 4a: When ti is a node where fl can be adjoined and ti~ is the root node of/~ P includes the production (T, p, q)\[oo ti\] --~ (T, p, q)\[oo r/ti'\] for each p, q E Q. Note that the adjunction node ti has been pushed below the new node rf on the stack. Case 4b: When t} is a node where 77 can be adjoined and 171 is the foot node offl P includes the production (/, p, q)\[oo ti.'\] --~ (_L, p, q)\[oo .\] for each p, q E Q. Note that the stack symbol that appeared below ti will be the node at which fl was adjoined.</Paragraph>
    <Paragraph position="32"> Since the state set of Mw is (0,...,n} there are O(n 2) nonterminals in the grammar. Since at most three states are used in the productions, M~ has O(n 3) productions. The time taken to construct this grammar is also O(n3). As in the cfg shared forest grammar constructed in Section 6 we have assumed that the tag is binary branching for sake of simplifying the presentation. The construction can be adapted to allow for any degree of branching through the use of additional (binary) lig productions. Furthermore, this would not increase the space complexity of the grammar. Finally, note that unlike the cfg shared forest grammar, in the lig shared forest grammar Gt0, w is derived in Go if and only if w is derived in Gt,. Of course in both cases L(Gt,) = {w}NL(Go) and hence the recognition problem can be solved by determining whether the shared forest grammar generates the empty set or not.</Paragraph>
  </Section>
  <Section position="8" start_page="389" end_page="389" type="metho">
    <SectionTitle>
8 Removing Useless Symbols
</SectionTitle>
    <Paragraph position="0"> As in the case of the cfg shared forest grammar, to solve the original recognition problem we have to determine if L(G~) ~ C/. In particular, we have to determine whether S~\[\] derives a terminal string. We solve this question by construcing an nfa, Ma~, from Gto where the states of Ma. correspond to the non-terminal and terminal symbols of Gw. This transforms the question of determining whether a symbol is useful into a reachibility question on the graph of Ma.. In particular, for any string of stack symbols % the object A\[7\] derives a string of terminals if and only if it is possible, in the nfa Ma.., to reach a final state from the state corresponding to A on the input 7. Thus, w e L(Go) if and only if S'\[\] ::~ w Gw if and only if in Ma. a final state is reachable from the state corresponding to S ~ on the empty string.</Paragraph>
    <Paragraph position="1"> Given a lig Gw = (V2v, TIT, VI,S', P) we construct the nfa Ma. = (Q, E, 6, q0, F) as follows. Let the state set of M be the nonterminal and terminal alphabet of Gw: i.e., Q = VN U VT. The initial state of MG,. is the start symbol of Gw, i.e., q0 - S'. The input alphabet of MG,. is the stack alphabet of G,,: i.e., E = VI. Note that since Gw is the lig shared forest the set VI is the set of the elementary node addresses of the object tag grammar Go. The set of final states, F, of MG,. is the set VT. The transition function 6 of Ma. is defined as follows.</Paragraph>
    <Paragraph position="2">  Given that w = al...an and that the nonterminals (and corresponding states in Ma,.) of Gw are of the form (T,i,j) or (.l.,i,j) where 0 &lt; i &lt; j &lt; n, there are O(n 2) nonterminals (states in Mto) inthe lig Gw. The size of Maw is O(n 4) since there are O(n 2) out-transitions from each state.</Paragraph>
    <Paragraph position="3"> We can use standard dynamic programming techniques to ensure that each production is considered only once. Given such an algorithm it is easy to check that the construction of Ma,. will take O(n s) time.</Paragraph>
    <Paragraph position="4"> The worst case corresponds to case 4a which will take O(n 4) for each production. However, there are only O(n 2) such productions (for which case 4a applies).</Paragraph>
    <Paragraph position="5"> Once the nfa has been constructed the recognition problem (i.e., whether w e L(Go)) takes O(n 2) time.</Paragraph>
    <Paragraph position="6"> We have to check if there is an e-transition from the initial state to a final state and hence we will have to consider O(n 2) transitions.</Paragraph>
    <Paragraph position="7"> A straightforward algorithm can be used to remove the states for nonterminals that do not appear in any sentential form derived from S I. In other words, only keep states such that for some 3' there is a derivation S\[\] ~ TIA\[TIT2 for some TIT2 E (Vv(Gu,) U VT)*.</Paragraph>
    <Paragraph position="8"> Note that the states to be removed are not those states that are not reachable from the initial state of Me,. The set of states reachable from the initial state includes only the set of nonterminals in objects that are the distiguished descendent of the root node in some derivation.</Paragraph>
    <Paragraph position="9"> /,From the construction of Mew it is that case that for each A E VN the set { 3' l a e/~(A, 3') for some a 6 F } is equal to the set Thus, if a final state is accessible from a state .4 then for some 3' (that witnesses the accessibility of a final state from .4) .413'1 for some z E V~.</Paragraph>
    <Paragraph position="10"> Once the construction of Me, is complete we only retain those productions in Gw that involve nonterminals that remain in the state set of of Me,. IIowever, unlike the case of the cfg shared forest grammar, the extraction of individual parses for the input w does not simply involve reading off a derivation of Gw. This is due to the fact that although retaining the state A does mean that there is a derivation S\[\] =~ TIA\[7\]T2 for some 3' and TIT2, we can Qw not guarantee that A\[7\] will derive a string of terminals. The next section describes how to deal with this problem.</Paragraph>
    <Paragraph position="11"> 9 Recovery of a Parse Let the lig Gw with useless productions removed be = ( VN , VT , VI , S' , P ) and let the nfa Maw constructed in Section 8 with unnecessary states removed be Maw = (VN U VT, V1,5, S', VT) Recovering a parse of the string w by the object grammar Go has now been converted into the problem of extracting one of the derivations of Gw. However, this is not entirely straightforward.</Paragraph>
    <Paragraph position="12"> The presence of a state A in V N \[.J VT indicates that for some 7 in V\[ and T1, T~ in (Vc(Gw) U liT)* we have S'\[\] ~ T1A\[TIT2 However, it is not necessarily the case that $(A, 7)f3 lit i~ C/, i.e., it might not be possible to reach a final state of Ma,, from A with input 7. All we know is that there is some 3 / E V/* (that could be distinct from 7) such that A\[7' \] derives a terminal string, i.e., at least one final state is accessible from A on the string 7'.</Paragraph>
    <Paragraph position="13"> This means that in recovering a derivation of Gw by considering the top-down application of productions we must be careful about which production we choose at each stage. We cannot assume that any choice of production for an object, A\[7\] will eventually lead to a complete derivation. Even if the top of the stack 3' is compatible with the use of a production, this does not guarantee that A\[3'\] derives a terminal string.</Paragraph>
    <Paragraph position="14"> We give an procedure recover that can be used to recover a derivation of G~ by using the nfa Ma..</Paragraph>
    <Paragraph position="15"> This procedure guarantees that when we reach a state A by traversing a path 3' from the initial state then on the same string 3' a final state can be reached from the state A.</Paragraph>
    <Paragraph position="16"> If recover(T1 ... T,a) is invoked the following hold.  To recover a parse we call recover(((-r, 1, n), ,j)a) where a E liT such that 6((T, 1, n), O) = a and T/6 lit is the root of some initial tree. The definition of recover is as follows.</Paragraph>
    <Paragraph position="18"> then output p. Note there must be such a production Case 2a: If there is some production</Paragraph>
    <Paragraph position="20"> such that 6(C, 1&amp;quot;) = b for some b * VT, and either n &gt; 1 and A2 * ~(B,l') (where T2 = (A2,t/2)) or n = 1 and a * 6(B, 1') then output p. recover((B, I')T2... Tna). recover((C, l&amp;quot;)b) Case 2b: If there is some production</Paragraph>
    <Paragraph position="22"> such that 6(0, l&amp;quot;) = b for some b * VT and either n &gt; 1 and A2 * 6(S,l') (where T2 = (A2,t/2)) or n = 1 and a * 6(B, i') then output p. recover((B, l')T2... Tna). recover((C,/&amp;quot;)b) Case 3: If there is some production</Paragraph>
    <Paragraph position="24"> such that either n &gt; 1 and A2 * 6(B,l') (where</Paragraph>
    <Paragraph position="26"> such that C * 6(B, l ~) for some C * VN and A2 * 6(C, th) and either n &gt; 1 and T~ = (A2, t/z) or n = 1 and a * 6(C, t/l) then output p. recover((B, l' )( C, t/l )T2 . . . T, a ) Case 4b: If there is a production</Paragraph>
    <Paragraph position="28"> such that n &gt; 1 and T2 = (Az,y2) then output p. recover(T2... T,) Given the form of the nonterminals and productions of Gto we can see that the complexity of extracting a parse as above is dominated by the complexity by Case 4a which takes O(n 4) time. If in Go every elementary tree has at least one terminal symbol in its frontier (as in a lexicalized tag) then to derive a string of length n there can beat most n adjunctions. In that case, when we wish to recover a parse the derivation height (which gives recursion depth of the the invocation of the above procedure) is O(n) and hence recovery of a parse will take O(n 5) time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML