File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1514_intro.xml

Size: 16,789 bytes

Last Modified: 2025-10-06 14:03:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1514">
  <Title>Generating XTAG Parsers from Algebraic Specifications[?]</Title>
  <Section position="3" start_page="0" end_page="106" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Since Tree Adjoining Grammars (TAG) were introduced, several different parsing algorithms for these grammars have been developed, each with its peculiar characteristics. Identifying the advantages and disadvantages of each of them is not trivial, and there are no comparative studies between them in the literature that work with reallife, wide coverage grammars. In this paper, we use a generic tool based on parsing schemata to generate implementations of several TAG parsers and compare them by parsing with the XTAG English Grammar (XTAG, 2001).</Paragraph>
    <Paragraph position="1"> The parsing schemata formalism (Sikkel, 1997) is a framework that allows us to describe parsers in a simple and declarative way. A parsing schema [?] Partially supported by Ministerio de Educaci'on y Ciencia and FEDER (TIN2004-07246-C03-01, TIN2004-07246-C03-02), Xunta de Galicia (PGIDIT05PXIC30501PN, PGIDIT05PXIC10501PN, PGIDIT05SIN044E and PGIDIT05SIN059E), and Programa de becas FPU (Ministerio de Educaci'on y Ciencia). We are grateful to Eric Villemonte de la Clergerie and Franc,ois Barthelemy for their help in converting the XTAG grammar to XML.</Paragraph>
    <Paragraph position="2"> is a representation of a parsing algorithm as a set of inference rules which are used to perform deductions on intermediate results called items.</Paragraph>
    <Paragraph position="3"> These items represent sets of incomplete parse trees which the algorithm can generate. An input sentence to be analyzed produces an initial set of items. Additionally, a parsing schema must define a criterion to determine which items are final, i.e. which items correspond to complete parses of the input sentence. If it is possible to obtain a final item from the set of initial items by using the schema's inference rules (called deductive steps), then the input sentence belongs to the language defined by the grammar. The parse forest can then be retrieved from the intermediate items used to infer the final items, as in (Billot and Lang, 1989).</Paragraph>
    <Paragraph position="4"> As an example, we introduce a CYK-based algorithm (Vijay-Shanker and Joshi, 1985) for TAG. Given a tree adjoining grammar G = (VT,VN,S,I,A)1 and a sentence of length n which we denote by a1 a2 ... an2, we denote by P(G) the set of productions {Ng Ng1 Ng2 ...Ngr } such that Ng is an inner node of a tree g [?] (I [?] A), and Ng1 Ng2 ...Ngr is the ordered sequence of direct children of Ng.</Paragraph>
    <Paragraph position="5"> The parsing schema for the TAG CYK-based algorithm (Alonso et al., 1999) is a function that maps such a grammar G to a deduction system whose domain is the set of items {[Ng,i,j,p,q,adj]} verifying that Ng is a tree node in an elementary  which nonterminal symbols are represented by uppercase letters (A, B ...), and terminals by lowercase letters (a, b...). Greek letters (a, b...) will be used to represent trees, Ng a node in the tree g, and Rg the root node of the tree g.  tree g [?] (I [?] A), i and j (0 [?] i [?] j) are string positions, p and q may be undefined or instantiated to positions i [?] p [?] q [?] j (the latter only when g [?] A), and adj [?] {true,false} indicates whether an adjunction has been performed on node Ng.</Paragraph>
    <Paragraph position="6"> The positions i and j indicate that a substring ai+1 ...aj of the string is being recognized, and positions p and q denote the substring dominated by g's foot node. The final item set would be {[Ra,0,n,[?],[?],adj]  |a [?] I} for the presence of such an item would indicate that there exists a valid parse tree with yield a1 a2 ... an and rooted at Ra, the root of an initial tree; and therefore there exists a complete parse tree for the sentence.</Paragraph>
    <Paragraph position="7"> A deductive step e1...emx Ph allows us to infer the item specified by its consequent x from those in its antecedents e1 ...em. Side conditions (Ph) specify the valid values for the variables appearing in the antecedents and consequent, and may refer to grammar rules or specify other constraints that must be verified in order to infer the consequent.</Paragraph>
    <Paragraph position="8"> The deductive steps for our CYK-based parser are shown in figure 1. The steps DScanCYK and Depsilon1CYK are used to start the bottom-up parsing process by recognizing a terminal symbol for the input string, or none if we are using a tree with an epsilon node.</Paragraph>
    <Paragraph position="9"> The DBinaryCYK step (where the operation p [?] pprime returns p if p is defined, and pprime otherwise) represents the bottom-up parsing operation which joins two subtrees into one, and is analogous to one of the deductive steps of the CYK parser for CFG. The DUnaryCYK step is used to handle unary branching productions. DFootCYK and DAdjCYK implement the adjunction operation, where a tree b is adjoined into a node Ng; their side condition b [?] adj(Ng) means that b must be adjoinable into the node Ng (which involves checking that Ng is an adjunction node, comparing its label to Rb's and verifying that no adjunction constraint disallows the operation). Finally, the DSubsCYK step implements the substitution operation in grammars supporting it.</Paragraph>
    <Paragraph position="10"> As can be seen from the example, parsing schemata are simple, high-level descriptions that convey the fundamental semantics of parsing algorithms while abstracting implementation details: they define a set of possible intermediate results and allowed operations on them, but they don't specify data structures for storing the results or an order for the operations to be executed. This high abstraction level makes schemata useful for defining, comparing and analyzing parsers in pencil and paper without worrying about implementation details. However, if we want to actually execute the parsers and analyze their results and performance in a computer, they must be implemented in a programming language, making it necessary to lose the high level of abstraction in order to obtain functional and efficient implementations.</Paragraph>
    <Paragraph position="11"> In order to bridge this gap between theory and practice, we have designed and implemented a system able to automatically transform parsing schemata into efficient Java implementations of their corresponding algorithms. The input to this system is a simple and declarative representation of a parsing schema, which is practically equal to the formal notation that we used previously. For example, this is the DBinaryCYK deductive step shown in figure 1 in a format readable by our compiler:</Paragraph>
    <Paragraph position="13"> The parsing schemata compilation technique used by our system is based on the following fundamental ideas (G'omez-Rodr'iguez et al., 2006a): * Each deductive step is compiled to a Java class containing code to match and search for antecedent items and generate the corresponding conclusions from the consequent.</Paragraph>
    <Paragraph position="14"> * The step classes are coordinated by a deductive parsing engine, as the one described in (Shieber et al., 1995). This algorithm ensures a sound and complete deduction process, guaranteeing that all items that can be generated from the initial items will be obtained.</Paragraph>
    <Paragraph position="15"> * To attain efficiency, an automatic analysis of the schema is performed in order to create indexes allowing fast access to items. As each different parsing schema needs to perform different searches for antecedent items, the index structures we generate are schema-specific. In this way, we guarantee constant-time access to items so that the computational complexity of our generated implementations is never above the theoretical complexity of the parsers.</Paragraph>
    <Paragraph position="16"> * Since parsing schemata have an open notation, for any mathematical object can potentially appear inside items, the system includes an extensibility mechanism which can be used to define new kinds of objects to use in schemata.</Paragraph>
    <Paragraph position="18"> 2 Generating parsers for the XTAG grammar By using parsing schemata as the ones in (Alonso et al., 1999; Nederhof, 1999) as input to our system, we can easily obtain efficient implementations of several TAG parsing algorithms. In this section, we describe how we have dealt with the particular characteristics of the XTAG grammar in order to make it compatible with our generic compilation technique; and we also provide empirical results which allow us to compare the performance of several different TAG parsing algorithms in the practical case of the XTAG grammar. It shall be noted that similar comparisons have been made with smaller grammars, such as simplified subsets of the XTAG grammar, but not with the whole XTAG grammar with all its trees and feature structures. Therefore, our comparison provides valuable information about the behavior of various parsers on a complete, large-scale natural language grammar. This behavior is very different from the one that can be observed on small grammars, since grammar size becomes a dominant factor in computational complexity when large grammars like the XTAG are used to parse relatively small natural language sentences (G'omez-Rodr'iguez et al., 2006b).</Paragraph>
    <Section position="1" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
2.1 Grammar conversion
</SectionTitle>
      <Paragraph position="0"> The first step we undertook in order to generate parsers for the XTAG grammar was a full conversion of the grammar to an XML-based format, a variant of the TAG markup language (TAGML).</Paragraph>
      <Paragraph position="1"> In this way we had the grammar in a well-defined format, easy to parse and modify. During this conversion, the trees' anchor nodes were duplicated in order to make our generic TAG parsers allow adjunctions on anchor nodes, which is allowed in the XTAG grammar.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
2.2 Feature structure unification
</SectionTitle>
      <Paragraph position="0"> Two strategies may be used in order to take unification into account in parsing: feature structures can be unified after parsing or during parsing. We have compared the two approaches for the XTAG grammar (see table 1), and the general conclusion is that unification during parsing performs better for most of the sentences, although its runtimes have a larger variance and it performs much worse for some particular cases.</Paragraph>
      <Paragraph position="1"> In order to implement unification during parsing in our parsing schemata based system, we must extend our schemata in order to perform unification.</Paragraph>
      <Paragraph position="2"> This can be done in the following way: * Items are extended so that they will hold a feature structure in addition to the rest of the information they include.</Paragraph>
      <Paragraph position="3"> * We need to define two operations on feature structures: the unification operation and the &amp;quot;keep variables&amp;quot; operation. The &amp;quot;keep variables&amp;quot; operation is a transformation on feature structures that takes a feature structure as an argument, which may contain features, values, symbolic variables and associations between them, and returns a feature structure containing only the variable-value associations related to a given elementary tree, ignoring the variables and values not associated through these relations, and completely ignoring features.</Paragraph>
      <Paragraph position="4"> * During the process of parsing, feature structures that refer to the same node, or to nodes that are taking part in a substitution or adjunction and  during and after parsing. The following data are shown: mean, trimmed means (10 and 20%), quartiles, standard deviation, and p-value for the Wilcoxon paired signed rank test (the p-value of 0.4545 indicates that no statistically significant difference was found between the medians). are going to collapse to a single node in the final parse tree, must be unified. For this to be done, the test that these nodes must unify is added as a side condition to the steps that must handle them, and the unification results are included in the item generated by the consequent. Of course, considerations about the different role of the top and bottom feature structures in adjunction and substitution must be taken into account when determining which feature structures must be unified.</Paragraph>
      <Paragraph position="5"> * Feature structures in items must only hold variable-value associations for the symbolic variables appearing in the tree to which the structures refer, for these relationships hold the information that we need in order to propagate values according to the rules specified in the unification equations. Variable-value associations referring to different elementary trees are irrelevant when parsing a given tree, and feature-value and feature-variable associations are local to a node and can't be extrapolated to other nodes, so we won't propagate any of this information in items. However, it must be used locally for unification. Therefore, steps perform unification by using the information in their antecedent items and recovering complete feature structures associated to nodes directly from the grammar, and then use the &amp;quot;keep-variables&amp;quot; operation to remove the information that we don't need in the consequent item.</Paragraph>
      <Paragraph position="6"> * In some algorithms, such as CYK, a single deductive step deals with several different elementary tree nodes that don't collapse into one in the final parse tree. In this case, several &amp;quot;keep variables&amp;quot; operations must be performed on each step execution, one for each of these nodes. If we just unified the information on all the nodes and called &amp;quot;keep variables&amp;quot; at the end, we could propagate information incorrectly.</Paragraph>
      <Paragraph position="7"> * In Earley-type algorithms, we must take a decision about how predictor steps handle feature structures. Two options are possible: one is propagating the feature structure in the antecedent item to the consequent, and the other is discarding the feature structure and generating a consequent whose associated feature structure is empty. The first option has the advantage that violations of unification constraints are detected earlier, thus avoiding the generation of some items. However, in scenarios where a predictor is applied to several items differing only in their associated feature structures, this approach generates several different items while the discarding approach collapses them into a single consequent item. Moreover, the propagating approach favors the appearance of items with more complex feature structures, thus making unification operations slower. In practice, for XTAG we have found that these drawbacks of propagating the structures overcome the advantages, especially in complex sentences, where the discarding approach performs much better.</Paragraph>
    </Section>
    <Section position="3" start_page="105" end_page="106" type="sub_section">
      <SectionTitle>
2.3 Tree filtering
</SectionTitle>
      <Paragraph position="0"> The full XTAG English grammar contains thousands of elementary trees, so performance is not good if we use the whole grammar to parse each sentence. Tree selection filters (Schabes and Joshi, 1991) are used to select a subset of the grammar, discarding the trees which are known not to be useful given the words in the input sentence.</Paragraph>
      <Paragraph position="1"> To emulate this functionality in our parsing schema-based system, we have used its extensibility mechanism to define a function Selectstree(a,T) that returns true if the terminal symbol a selects the tree T. The implementation of this function is a Java method that looks for this information in XTAG's syntactic database. Then the function is inserted in a filtering step on our schemata:  [a,i,j] [Selected,a] alpha [?] Trees/SELECTS-TREE(A;a) The presence of an item of the form [Selected,a] indicates that the tree a has been selected by the filter and can be used for parsing. In order for the filter to take effect, we add [Selected,a] as an antecedent to every step in our schemata introducing a new tree a into the parse (such as initters, substitution and adjoining steps). In this way we guarantee that no trees that don't pass the filter will be used for parsing.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML