File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2304_metho.xml
Size: 12,001 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2304"> <Title>A Robust and Efficient Parser for Non-Canonical Inputs</Title> <Section position="3" start_page="19" end_page="20" type="metho"> <SectionTitle> 2 Property Grammars: a constraint- </SectionTitle> <Paragraph position="0"> based formalism We present in this section the formalism of Prop-erty Grammars (see [Bes99] for preliminary ideas, and [Blache00], [Blache05] for a presentation). The main characteristics of Property Grammars (noted hereafter PG), is that all information is represented by means of constraints. Moreover, grammaticality does not constitute the core question but become a side effect of a more general notion called characterization: an input is not associated to a syntactic structure, but described with its syntactic properties.</Paragraph> <Paragraph position="1"> PG makes it possible to represent syntactic information in a decentralized way and at different levels. Instead of using sub-trees as with classical generative approaches, PG specifies directly constraints on features, categories or set of categories, independently of the structure to which they are supposed to belong. This characteristic is fundamental in dealing with partial, underspecified or non canonical data. It is then possible to stipulate relations between two objects, independently from their position in the input or into a structure. The description of the syntactic properties of an input can then be done very precisely, including the case of non canonical or non grammatical input. We give in the remaining of the section a brief overview of GP characteristics All syntactic information is represented in PG by means of constraints (also called properties).</Paragraph> <Paragraph position="2"> They stipulate different kinds of relation between categories such as linear precedence, imperative co-occurrence, dependency, repetition, etc. There is a limited number of types of properties. In the technique described here, we use the following ones: - Linear precedence: Det < N (a determiner precedes the noun) - Dependency: AP - N (an adjectival phrase depends on the noun) - Requirement: V[inf] = to (an infinitive comes with to) - Exclusion: seems [?] ThatClause[subj] (the verb seems cannot have That clause subjects) - Uniqueness : UniqNP{Det}(the determiner is unique in a NP) - Obligation : ObligNP{N, Pro}(a pronoun or a noun is mandatory in a NP) This list can be completed according to the needs or the language to be parsed. In this formalism, a category, whatever its level is described with a set of properties, all of them being at the same level and none having to be verified before another. null Parsing a sentence in PG consists in verifying for each category the set of corresponding properties in the grammar. More precisely, the idea consists in verifying for each constituent subset its relevant constraints (i.e. the one applying to the ele- null ments of the subset). Some of these properties can be satisfied, some other can be violated. The result of this evaluation, for a category, is a set of properties together with their evaluation. We call such set the characterization of the category.</Paragraph> <Paragraph position="3"> Such an approach makes it possible to describe any kind of input.</Paragraph> <Paragraph position="4"> Such flexibility has however a cost: parsing in PG is exponential (cf. [VanRullen05]). This complexity comes from several sources. First, this approach offers the possibility to consider all categories, independently from its corresponding position in the input, as possible constituent for another category. This makes it possible for example to take into account long distance or non projective dependencies between two units.</Paragraph> <Paragraph position="5"> Moreover, parsing non canonical utterances relies on the possibility of building characterizations with satisfied and violated constraints. In terms of implementation, a property being a constraint, this means the necessity to propose a constraint relaxation technique. Constraint relaxation and discontinuity are the main complexity factors of the PG parsing problem. The technique describe in the next section propose to control these aspects.</Paragraph> </Section> <Section position="4" start_page="20" end_page="21" type="metho"> <SectionTitle> 3 Parsing in PG </SectionTitle> <Paragraph position="0"> Before a description of the controlled parsing technique proposed here, we first present the general parsing schemata in PG. The process consists in building the list of all possible sets of categories that are potentially constituents of a syntactic unit (also called constructions). A characterization is built for each of this set. Insofar as constructions can be discontinuous, it is necessary to build all possible combinations of categories, in other words, the subsets set of the categories corresponding to the input to be parsed, starting from the lexical categories. We call assignment such a subset. All assignments have then, theoretically, to be evaluated with respect to the grammar. This means, for each assignment, traversing the constraint system and evaluating all relevant constraints (i.e. constraints involving categories belonging to the assignment).</Paragraph> <Paragraph position="1"> For some assignments, no property is relevant and the corresponding characterization is the empty set: we say in this case that the assignment in non productive. In other cases, the characterization is formed with all the evaluated properties, whatever their status (satisfied or not). At the first stage, all constructions contain only lexical categories, as in the following example: An assignment with a productive characterization entails the instantiation of the construction as a new category; added to the set of categories.</Paragraph> <Paragraph position="2"> In the previous examples, AP and NP are then added to the initial set of lexical categories. A new set of assignments is then built, including these new categories as possible constituents, making it possible to identify new constructions.</Paragraph> <Paragraph position="3"> This general mechanism can be summarized as follows: Initialization [?] word at a position i: create the set ci of its possible Until new characterization are built This parsing process underlines the complexity coming from the number of assignments to be taken into account: this set has to be rebuilt at each step (i.e. when a new construction is added).</Paragraph> <Paragraph position="4"> As explained above, each assignment has to be evaluated. This process comes to build a characterization formed by the set of its relevant properties. A property p is relevant for an assignment A when A contains categories involved in the evaluation of p. In the case of unary properties constraining a category c, the relevance is directly known. In the case of n-ary properties, the situation is different for positive or negative properties. The former (e.g. cooccurrence constraints) concern two realized categories. In this case, c1 and c2 being these categories, we have {c1, c2} [?] A. In the case of negative properties (e.g. cooccurrence restriction), we need to have either c1 [?]A or c2 [?]A.</Paragraph> <Paragraph position="5"> When a property is relevant for a given A, its satisfiability is evaluated, according to the prop- null erty semantics, each property being associated to a solver. The general process is described as follows: null Let G the set of properties in the grammar, let A an assignment [?] pi [?] G, if pi is relevant Evaluate the satisfiability of pi for A Add pi and its evaluation to the characterization C of A Check whether C is productive In this process, for all assignments, all properties have to be checked to verify their relevance and eventually their satisfiability.</Paragraph> <Paragraph position="6"> The last aspect of this general process concerns the evaluation of the productivity of the characterization or an assignment. A productive assignment makes it possible to instantiate the corresponding category and to consider it as realized. A characterization is obviously productive when all properties are satisfied. But it is also possible to consider an assignment as productive when it contains violated properties. It is then possible to build categories, or more generally constructions, even for non canonical forms. In this case, the characterization is not entirely positive. This process has to be controlled. The basic control consists in deciding a threshold of violated constraints. It is also possible to be more precise and propose a hierarchization of the constraint system: some types of constraints or some constraints can play a more important role than others (cf. [Blache05b]).</Paragraph> <Paragraph position="7"> A controlled version of this parsing schema, implemented in the experimentation described in the next section, takes advantage of the general framework, in particular in terms of robustness implemented as constraint relaxation. The process is however controlled for the construction of the assignment.</Paragraph> <Paragraph position="8"> This control process relies on a left-corner strategy, adapted to the PG parsing schema. This strategy consists in identifying whether a category can start a new phrase. It makes it possible to drastically reduce the number of assignments and then control ambiguity. Moreover, the left corner suggests a construction label. The set of properties taken into consideration when building the characterization is then reduced to the set of properties corresponding to the label. These two controls, plus a disambiguation of the lexical level by means of an adapted POS tagger, render the parsing process very efficient.</Paragraph> <Paragraph position="9"> The left corner process relies on a precedence table, calculated for each category according to the precedence properties in the grammar. This table is built automatically in verifying for each category whether, according to a given construction, it can precede all the other categories. The process consists in verifying that the category is not a left member of a precedence property of the construction. If so, the category is said to be a possible left corner of the construction. The precedence table contains then for each category the label of the construction for which it can be left corner.</Paragraph> <Paragraph position="10"> During the process, when a category is a potential left corner of a construction C, we verify that the C is not the last construction opened by a left corner. If so, a new left corner is identified, and C is added to the set of possible constituents (usable by other assignments). Moreover, the characterization of the assignment beginning with ci is built in verifying the subset of properties describing C.</Paragraph> <Paragraph position="11"> The generation of the assignments can also be controlled by means of a co-constituency table.</Paragraph> <Paragraph position="12"> This table consists for each category, in indicating all the categories with which it belongs to a positive property. This table is easily built with a simple traversal of the constraint system. Adding a new category ci to an assignment A is possible only when ci appears as a co-constituent of a category belonging to A.</Paragraph> <Paragraph position="13"> S initial set of lexical categories Identification all the left corners For all C, construction opened by a left corner ci with G' the set of properties describing C Build assignments beginning by ci Build characterizations verifying G' The parsing mechanism described here takes advantage of the robustness of PG. All kind of input, whatever its form, can be parsed because if the possibility of relaxing constraints. Moreover, the control technique makes it possible to reduce the complexity of the process without modifying its philosophy.</Paragraph> </Section> class="xml-element"></Paper>