File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1075_metho.xml
Size: 26,900 bytes
Last Modified: 2025-10-06 14:13:34
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1075"> <Title>A MODULAR ARCHITECTURE FOR CONSTRAINT-BASED PARSING</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> A MODULAR ARCHITECTURE FOR CONSTRAINT-BASED PARSING Francois Barthdlemy ~&quot; Fran(;ois Rouaix 0 0 INRIA Roequeneourt, BP 105, 78153 Le Chesnay cedex, France & Universidade Nova de Lisboa, 2825 Monte de Caparica, Portugal ABSTRACT </SectionTitle> <Paragraph position="0"> This paper presents a framework and a system for implementing, comparing and analyzing parsers for some classes of Constraint-Based Grammars.</Paragraph> <Paragraph position="1"> The framework consists in a uniform theoretic description of parsing algorithms, and provides the structure for decomposing the system into logical components, with possibly several interchangeable implementations. Many parsing algorithms can be obtained by compositi(m of the modules of our system. Modularity is also ,~ way of achieving code sharing for the common parts of these various algorithms. Furthermore, tile design lielpi~ reusing the existing modules when implementing other algorithms. The system uses the flexible modularity provided by the programmifig languages hleool-90, 1)ased on a type system that ensures the safety of module composition.</Paragraph> </Section> <Section position="2" start_page="0" end_page="454" type="metho"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> We designed a system to study parsing. Our aim was not to implement only one parsing algorithm, but as many as possible, in such a way that we could compare their performances. We wanted to study parsers' behavior rather than using them to exploit their parses. Furthermore, we wanted a system opened to new developments, impossibh~ to predict at the time we began our project.</Paragraph> <Paragraph position="1"> We achieved these aims by detining a mo(lular architecture that gives us in addition code sharing between alternative implementations.</Paragraph> <Paragraph position="2"> Onr system, called APOC-II, implements more than 60 ditferent parsing algorithms for Context-Free Grammars, Tree-Adjoining Grammars, and Definite-Clause Grammars. The different generated parsers are comparable, because they are implemented in the same way, with common data structures. Experimental comparison can involve more than 20 parsers for a given grammar and give results independent from the implementation. null Fnrthermore, adding new modules multiplies the mHnber of parsing Mgorithm. APOC-II is open to new parsing techniques to such an extent that it can be seen as a library of tools for parsing, including constraint solvers, look-ahead, parsing strategies and control strategies. These tools make prototyping of parshlg algorithms easier an(l qui(:ker.</Paragraph> <Paragraph position="3"> The system is I)ase(1 on a general framework that divides parsing matters in three different tasks. First, tl,e compili~tion that translates a grammar into a push-down automaton (tescribing how a parse-tree is built. The automaton can be non-determinlstic if several trees have to be eonsidere(l when parsing a string. Second, the interl)retation of the push-down ~mtomaton that has to deal with non-determinism. Third, the constraint solving, used by 1)oth eomi)ilation and interpretation to perform operations related to constraints.</Paragraph> <Paragraph position="4"> Several algorithms can perform each of these three tasks: the compiler can generate either top-down or bottom-up automata, the interl)reter can make use of backtracldng or of tal)ulation and the solver has to deal with different kinds of constraints (first-order terms, features, ... ).</Paragraph> <Paragraph position="5"> Our architecture allows different combinations of three components (one for each basic task) resulting into a specific parsing system. We use the Alcoo\[-90 progranmfing language to implement our mo(hlles. This language's type system allows the definition of alternative implementations of a conlponent and enmlres the safety of module cond)ination, i.e. each module provides what is neede(1 by other mo(lules and re(:eives what it requires. null The same kind of modularity is used to split the main components (conll)iler, interpreter, solver) into independent snb-modnles. Some of these sub-modules can bc shared by several different implementations. For instance the coml)utation of look-ahead is the same for LL(k) and LR(k) techniques.</Paragraph> <Paragraph position="6"> The next section defines the class of grammar we consider. Then, ~t general framework for parsing and the sort of modularity it requires are presented. Section 4 is devoted to the AIcool-90 language that provides a convenient module system. Section 5 is the detailed description of tile APOC- null II system that implements the gonoral ff~tmework using Alcool-90.</Paragraph> </Section> <Section position="3" start_page="454" end_page="454" type="metho"> <SectionTitle> 2 CONSTII.AINT- B ASED C~RAMMARS </SectionTitle> <Paragraph position="0"> The notion of Constraint-Based Gramm~tr aiiile~tred ill computational linglfistic. It is rt useful allstraction of several classes of grammars, ineludhlg the most commonly used to describe NatuntI Language in view of COmlmter processing.</Paragraph> <Paragraph position="1"> Wo give our own definition of constraint-lmsed grammars that may slightly differ from other definitions. null Definition 1 ConstTnint-11ased Grammar A constraint-based grammar is a 7-tuple {Nt, T, (~, V, Am, C L, R} where loin) having V as variable set and being closed it~'tder renaming a~td conjunction * R is a finite set of rules of the form: -, (2',) .... , <2;,) such that so E Nt, sl ~ Nt U 7' for 0 < i _<. n, c e CL, Xi are tuples of (t(sl) distinct va,'iables, and the same wwiabIe cannot appear in two different tupIes.</Paragraph> <Paragraph position="2"> in this definitio,t, we use the notion (if constraint language to define the syntax and the semantics of the constraints usod 1)y the grammars. Wo refer to the definition given Iiy H/Sfcld and Smollm in \[ITS88\]. This detinition is especially suitable for constraints used in NLP (unrestricted synt*tx, multiplicity (if interpretation donmins). The closure under renaming property has ~tlso 1lees detined by IISfeld and Snlolka. It ensures tlt~tt constraints are independent from the variable names. This grmtnds the systematic renaming of grammar rules to avoid wtriallle conflicts. Definition 2 Constrnint Language A constraint Language is a 4-tuple (V,C,u,I) such that: Ii'or bt<:k of Slm<:e we <lo not recall in detail what itll interpret&tioll Jill( |the &quot;<'losuro lllldel&quot; I'(!IlH.III ~ ing&quot; pr<)perty are, and refer to \[IIS88\]. The semantics of Constra.int-Based Gnmmlars is defined by the .'-;(?lllalltics of the constra.int language ~tll(l l, ho notion of syntax tree. A synta.x trce is a tree which \]ms at grammttr rule (remtmed with fi'esh v~triables) as latml of ea.ch nodo. A constraint is associatted to at parse tree: it is the conjunction of all the constr~dnts of the labels and the oqualities between the tUllle of wtriables from the non-termilml ,if the loft-hand side of a label and the tlq)le of the relewmt symbol of tim right> hand side of tim l~dml of its p~trent.</Paragraph> <Paragraph position="3"> An hnportant lloint ~dmut p;trse trees is tlt*tt the ordor of terminal symbols of tll(~ ini)ut string and the order of the symhols in rig}lt-h;md sides of rules are signitica.nt.</Paragraph> <Paragraph position="4"> A Context-Free Gramma, r is obtained just by ,'omoving tutiles and constr~dnts fl'om tho grammar rules. Most i)m'sing techniques for Constraint-Bas(~d Grainmars use the underlyillg context-fro(! structure, to guido parsing. This allows the ,'euse of cont.ext-fl'ee lntrsing tccl,niques. T}Io g~r;tllllll;H's wo hltve just definod OIICOIIIpass several classes {if i;r&llllll;trs llSOd ill N\],\] ), including log;it p;l'amlttlal'S (Definite Clause Crammars and variants), UIlifica~tion Cramlmtrs, Tree Adjoining (h'ammars I and, at least p~trtially, i,exical-I;'unctioval C~l'~tlllllHli's ;ilia I/oral Phras(~ ~.I'IIC~/.III'(~ (.*fl'~llllllllLl'S. ()1&quot; ('OllI'S(~ 1 t,h(!r(~ ;tl'(~ syntactical differ(mces 1)(~twe(m these (:lassos altd Constraint-Based (ll'amlmU'S. A simple t:ransla.t.ion \['r()lll on(? syntax t,/) {.he ()th(,r is n(~(:essary.</Paragraph> </Section> <Section position="4" start_page="454" end_page="456" type="metho"> <SectionTitle> 3 A G ENF.RAI, \]?RAMEWOI{K FOIl. PARSING </SectionTitle> <Paragraph position="0"> This section is devoted to it general fralnework for iiarsing ill which most of the i)arsing inethods, inchlding~ all the lnost COtlllllOtl OliOS, ar(\] expressible. It is ;in extension of ~ contoxt-freo framowork \[Lan74\]. it is based on an explicit separation lletween tho parsing strategy that descrilies how ITAGs have an underlying context-free structure, although this is not ol)vi(ms in their formM definition. See for instance \[I,angl\].</Paragraph> <Paragraph position="1"> syntax trees are built (e.g. top-<lown, bottom-Ill)), and the control strategy that <lcals with the non-determinism of the parsing (e.g. backtracking, tabulation).</Paragraph> <Section position="1" start_page="455" end_page="455" type="sub_section"> <SectionTitle> 3.1 EPDAs </SectionTitle> <Paragraph position="0"> This separation is based on an intermediate representation that describes how a grammar is used following a given parsing strategy. This intermediate representation is a Push-Down Automaton. It is known that most context-free parsers can be encoded with such a stack machine. Of course, the usual formalism has to be extended to take constraints into account, and possibly use them to disambiguate the parsing. We. call Extended Push-Down Automaton (EPDA) the extended formalism.</Paragraph> <Paragraph position="1"> For lack of space, we do not give here the formal definition of EPDA. hfformally, it is a machine using three data structures: a stack containing at each level a stack symbol and its tuple of variables; a representation of the terminal string that distinguishes those that have already been used and those that are still to be read; finally a constraint. A configuration of an automaton is a triple of these three data. Transitions are partial fimctions from configurations to configurations. We add some restrictions to these transitions: the only clmnge allowed for the string is that at most one more terminal is read; only the top of the stack is accessible and at most one symbol can be added or removed from it at once.</Paragraph> <Paragraph position="2"> These restrictions are needed to employ directly the generic tabular techniques for automata execution described in \[BVdlC92\]. EPDAs may be non-deterministic, i.e. several transitions are applicable on a given configuration.</Paragraph> <Paragraph position="3"> blen(ls two tasks: * The structural part, that consists in buihling the skeleton of parse trees. This l)art is similar to a context-free parsing with the underlying context-free projection of the grammar.</Paragraph> <Paragraph position="4"> * Solving the constraints of this skeleton.</Paragraph> <Paragraph position="5"> The two tasks are related in the following way: constraints appear at the nodes of the tree; the structure is not a valid syntax tree if the constraint set is unsatisfiable. Each task can be performed in several ways: there are several context-free parsing methods (e.g. LL, LR) and constraints sets can be solved globally or incrementally, using various orders, and several ways of mixing the two tasks are valid. Tree construction involves a stack mechanism, and constraint solving results in a constraint. The different parsing teelmiques can be described as computations on these two data structures. EPDAs are thus able to enco<le various l)arsers for Constraint C~ramnlars. null Automatic translation of grammars into EPDAs is possible using extensions of usual context-free teelmiques \[Bar93\].</Paragraph> </Section> <Section position="2" start_page="455" end_page="456" type="sub_section"> <SectionTitle> 3.2 ARCIII'rECTUP=E </SectionTitle> <Paragraph position="0"> Thanks to the intermediate representation (EPDA), parsing can be divi<led into two independent passes: tile compilation that translates a granlnlar into an extended autonlaton; tim execution that takes an EPDA and a string and produees a forest of syntax trees. To achieve the independence, the compihw is not allowed to make any assumptions about the way the automata it produces will lie executed, and the interpreter in charge of the execution is not allowed to make assumptions about the automata it executes.</Paragraph> <Paragraph position="1"> We add to this scheme reused from context-free parsing a thir<l component: the solver (in an extensive meaning) in charge of all the oi>erations related to constraints and wu'iables. We will try to make it as in<lel)en<teilt from the other two modules (compiler and interpreter) as possible.</Paragraph> <Paragraph position="2"> There is not a fidl in<lependenee, since both the compiler and the interpreter involve constraints and related operations, that are: l)erfornmd by the solver. We just want to define a (:lear interface between the solver and the other modules, an interface independent from the kind of the constraints and from the solving algorithms being used. rl'be same coml)iler (resp. interl)reter ) used with different solvers will work on ditl'erent classes of grammars. For instance, the same compiler can compih~ Unilh:ation Grammars an<l Definite Clause Grammars, using two solvers, one implenmnting feature unilieation, the second one iml)lementing tirst-order unilieation.</Paragraph> <Paragraph position="3"> We can see a complete parsing system as the eoml)ination of three modules, compiler, interprefer, solver. When ea(:h module has several implementations, we wouhl like to take any combination of three modules. This schematic abstraction captures l)arsing algorithms we are interested in. However, actually defining interfaces for a practical system without restricting open-endedness or the abstraction (interehangeability of components) was the most difficult technical task of this work.</Paragraph> </Section> <Section position="3" start_page="456" end_page="456" type="sub_section"> <SectionTitle> 3.3 SOLVERS </SectionTitle> <Paragraph position="0"> The main problem lies in the dclinition of the solver's interface. Some of the required ol)eralions are ol)vious: renaming of constraints and tul)les, constraint lmilding, extraction of the varial)les from a constraint, etc.</Paragraph> <Paragraph position="1"> By the way, remark that constraint solving can be hidden within the solver, and thus not appear in the interface. There is an equivalence relation between constraints given by their interpretations. This relation can lie used to replace a constraint by another eqniwdent one, l)ossibly siml)ler. The solving call also be explicitly used to enR)ree the simplification of constraints at some points of tile parsing.</Paragraph> <Paragraph position="2"> Unfortunately some special techniques require more specific operations on constraints. For instance, a family of parsing strategies related to Earley's algorithm m~tke use of the restrictio~ operator defined by Shieber in \[Shi85\]. Another examl)le: some tabular techni(lues take Benetit from a projectioil operator that restricts constraints with respect to a subset of their variat)les.</Paragraph> <Paragraph position="3"> We. could define the solver's inte.rface as the cartesian product of all the operations used by ;tt least one technique. There are two reasons to re}cot such an apI)roaeh. The first one is that some seldom used operations are ditli(:ult to deline on some constraints domains, it is the case, among others, of tile projection. The second reason is that it woul(\[ restrict to the techniques aI: ready existing and known by us at the moment when we design tile interface. This contradicts the open-endedness requirement. A new olleration can appear, useful for a new parsing method or for optimizing the old ones.</Paragraph> <Paragraph position="4"> We prefer a flexible detlnition of the interface.</Paragraph> <Paragraph position="5"> Instead of defining one single interface, we will allow each alternative iniF, lenlentation of the solver to define exactly what it ol\['ers and each iml)h~nmntation of the compiler or of the interpreter to detine what it demands. The conll)ination of modules will involve the checking that the @r<'.r encompasses the demand, that all tile needed operations are implemented. This imposes restrictions on the combination of niodules: it is the overhead to obtain an open-ended system, opened to new developments.</Paragraph> <Paragraph position="6"> We found it language providing the. kind of llexil)le modularity we needed: Alcool--90. We now present this language.</Paragraph> </Section> </Section> <Section position="5" start_page="456" end_page="457" type="metho"> <SectionTitle> 4 '\]'IIE LANGUAGE ALCOOL 90 </SectionTitle> <Paragraph position="0"> Alcool-90 is an experimental extension of the functional language ML with run-time overloading \[I{ou90\]. Overloading is used as a tool for seamless integration of abstract data types ill the ML type system, retaining strong typing, and type inference prollerties. Abstract data types (encapsulating a data structure representation and its constructors ~uld interpretive flmctiol,s) i)rovide wdues for overloaded symbols, as classes provide methods for messages ill objecto,'ientcd terminology, i{owever, strong typing means that the compiler guarantees that errors ()f kind &quot;method not found&quot; never hal)pen.</Paragraph> <Paragraph position="1"> Abstract programs axe programs referring to overloaded syml)ols, which vahles will be deternfined at run-time, consistently with the calling environment. By grouping Mlstract l)rograms, we obtain parameterized abstra.ct data types (or fllnctors), the calling environment being here a~ particular instantiation of the I)arameterized adt.</Paragraph> <Paragraph position="2"> Thus, we obtain Jut environment equivalent to a module system, each module being an adt, eventually llarameterized.</Paragraph> <Paragraph position="3"> D)r instance, ill APOC-II, (:ompilers h~tve an abstract data type parameterized by a solver.</Paragraph> <Paragraph position="4"> Alcool-90 also proposes an innow~tive environment where we exploit anlbiguities due to overloading for semi-automated 1)rogram configuration : the type iufin'elice eoullnltes interfaces of %llissing&quot; COIllpollents to colnplete a progralll, aecording to the use of overloaded synlbols in the program. A search algo,'ithm finds components satisfying those interfaces, eventually by tinding suitable parameters for parameterized components. Naturally, instantiatiot, of parameterized coml)onents is also type-safe : actual parameters must have interfaces matching formal parameters (schematically : the actual parameter must provide at least the functions required by the interface of the formal parameter).</Paragraph> <Paragraph position="5"> For instance, only the solvers provi(lil,g Shieber's restriction can })e used as the. aetlial pa.ramcter of Earley with restriction compiler. But these solvers can also be '.lse(l l)y a.ll the eoml)ilers that do not use the restriction.</Paragraph> <Paragraph position="6"> Simple module systems have severe limitations when several implementations of components with simil~tr interfaces (:()exist in a system, or when some component Inay be employed in different contexts. Ada generics provided a first step to lnodule parameterization, th(mgh at the cost of heavy declar~tions a.nd difficulties with type equiwdence. SML pral)oses a very powerful module system with paranleterization, but lacks separate comllilation and still requires a large amount of user decl~u'ations to detine and use functors.</Paragraph> <Paragraph position="7"> Object-oriented languages lack the type security that Alcoo\[-90 guarantees.</Paragraph> <Paragraph position="8"> The Alcool-90 approach benefits from the simplification ot modules as abstract data types by adding inference facilities: the compiler is able to infer the interfaces of parameters required by a module. Moreover, the instantiation of a functor is simply seen as a type application, thus no efforts are required from the programmer, while its consistency is checked by the compiler.</Paragraph> <Paragraph position="9"> This approacl, is mostly useful when multiple implementations with similar interfaces are available, whether they will coexist in the program or they will be used to generate several configurations. Components may have similar interfaces but different semantics, although they are interchangeable. Choosing a configuration is simply choosing fl'om a set of solutions to missing emnponents, computed by the compiler.</Paragraph> <Paragraph position="10"> Several other features of Alcool-90 have not linen used in this experiment, namely the inheritance operator on abstract data types, and an extension of tile type system with dynamics (where some type checking occurs at run-time).</Paragraph> </Section> <Section position="6" start_page="457" end_page="458" type="metho"> <SectionTitle> 5 APOC-II </SectionTitle> <Paragraph position="0"> APOC-II is a system written in Alcool-90, implementing numerous parsing techniques within the framework described in section 3. The user can choose between these techniques to buihl a parser.</Paragraph> <Paragraph position="1"> By adding new modules written in Alcool-90 to the library, new techniques can freely be added to the system.</Paragraph> <Paragraph position="2"> APOC-II has two levels of modularity: the first one is that of the three main components distinguished above, compiler, interpreter and solver. Each of these components is implemented by several alternative modules, that are combinable using Alcool-90 discipline.</Paragraph> <Paragraph position="3"> Tile second level of modularity consist in splitring each of the three main components i,lto severa.1 modules. This makes the sharing of common parts of different hnplementations possible.</Paragraph> <Paragraph position="4"> We give now examples of splitting APOC-ql uses at the moment, in order to give an idea of this second level of modularity. This splitting has proved convenient so far, but it is not fixed and imposed to fllrther developments: ~t new implementation can be added even if it uses a completely different internal structure.</Paragraph> <Paragraph position="5"> A solver is made of: * a module for wtriables, variabh: generation and renaming, * a parser for constraints, * a pretty-printer for constraints, * a constraint builder (creation of abstract syntax trees for constraints, e.g. building constraints expressing equality of variables), * a solver ill the restrictive meaning, in charge of constraint reduction, * an interface that encapsulate all the other modules.</Paragraph> <Paragraph position="6"> A compiler includes: * a grammar parser (that uses tile constrMnt parser given by the solver), ing the &quot;compih?' function, tile only one exported. null The two interpreters implemented so far have very different structures. The tlrst one uses backtracking and the second one uses tabulation. They share some modules however, such as a module handling transitions and a lexer of inlmt strings.</Paragraph> <Paragraph position="7"> Tile interest of the modular architecture is in tile eomtfin~ttorhtl effect of module composition. It leads to many diiferent parsing algorithms. The tigure 1 summarizes the different ~spects of the parsing algorithms that can vary more or less independently.</Paragraph> <Paragraph position="8"> For example, the built-in parsing method of Prolog for DCGs is ol~t.ained by combining tim solver for \])CGs, the top-down strategy, 0 symbol of look-ahead a.nd a backtracking interpreter (and other modules not mentioned in Iigure 1 because they do not change the algorithm, but a.t most its implenmntation).</Paragraph> <Paragraph position="9"> Some remarks about :figure 1: * we call Earle?\] parsing strategy the way Earley deduction \[PW8a\] builds a tree, *tot the control method it uses. It difl'e.rs from top-down by the way constrMnts are taken into account. * the difference between garley-like tabulation and graph-structure stacks is the data structure used for item storage. Several variants are possible, that actually change the parser's behavior.</Paragraph> <Paragraph position="10"> Modules written iii. bold font are ah'eady iml)lemented, where.as modules written in italic m'e possible extensions to the system.</Paragraph> <Paragraph position="11"> * we call synchronization sL kind of breadth-first se~trch where sc~tnnlng a terminal is performed only whe.n it is needed by all the paths of the search-tree. The search is synchronized with the. input string. It is the order used by l,;strh.'y's algorithin.</Paragraph> <Paragraph position="12"> * at the moment, only generic look-ahead, that is look-ahestd based on the first and follow sets, has been considered. Some more aCCllrate look-ahead techniques such as the ones involved in SLR(k) pa,'sing are probal>ly not indepen<lent fi'om the parsing strategy and <:armor be an independent mo<lule.</Paragraph> <Paragraph position="13"> Building a parsing system with APOC-II consists roughly in choosing one module of each row of figure 1 and combining them. Some of the combinations are not possible. Thanks to typechecking, Alcool-90 will detect the incompatibility and provide a tyl)e-based explanation of the probh;m.</Paragraph> <Paragraph position="14"> At the moment, APOC-II otDrs more than 60 ditDrent parsing algorithms. Given a g, ralrHn.%r, there is a choice of more than 20 different parsers. Adding one module does not add only one more algorithm, but sewn'M new vstri;tltts.</Paragraph> <Paragraph position="15"> The techniques iinplemented by APOC-II are not original. For instance, the LR conq)ilation strategy comes from a paper I)y Nilsson, \[Nil86\], left-corner parsing has been used 1)y Matsumoto and Tanaka in \[MT83\]. As far as we know, however, LR and left-era'her p~trsers have not been prolmsed for Tree-Adjoining C, rammars before. Notice that the modularity is also useful to vary implementation of algorithms. D)r instance, a first prototype can be quickly written by implementing constraints reduction in a naive way. A refined version can be written later, if needed.</Paragraph> </Section> class="xml-element"></Paper>