File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4187_metho.xml

Size: 19,635 bytes

Last Modified: 2025-10-06 14:13:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4187">
  <Title>First Results of a French Linguistic Development Environment</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction: EGL
</SectionTitle>
    <Paragraph position="0"> The EGL (Enviroimement de Gdnie Linguistique) project started in 1989, with the proposal to create a linguistic software development environment containing a computational treatment of 1;'leach grmmltarJ Its three main objectives were to allow research groups working in NLP: m to develop and test both general l'Yencb graamtmrs and specific linguistic anMyses for that bmguage, * to test new parsers mtd to compare several parsers in a uniform setting, and * to have at their disposal an ~malyzer/generator for French, easy to maim rain and to port to other domains.</Paragraph>
    <Paragraph position="1"> tThe EGL project involves 6 different partners:  Machine. Development of the GPSG grammar of bYeneh was also supported by grants from the SSRC of Canada (grant #410-89-1469) and the FCAR of Quebec (grants #89-EQ-4213 and #92-ER-1198).</Paragraph>
    <Paragraph position="2"> Independently of a particular application, the envirolmmnt must be usable both as a component in a system making use of an existing syntactic database, aald as a development environment for new syntactic treatments of the language. The first phase of the EGL project was partly based oil a critical evaluation of existing work (in particular GDE \[1\]), eatd defined a general architecture with the following modules:  The initial grmmitatical formalism chosen was that of unification-based gralmnar and three main linguistic frameworks are taken into accomlt in F~G~: GPSG \[11\], Lt&amp;quot;G \[16\] and FUG \[17\]). The parser is based on tile general principle of a chart; different attalyzers for the tliffereut forxmdisms can be integrated into the system by making retereuce to that model and by including specitie nlethods for the types of objects they tmmipulate. Tile basic anMyzer is a revised version of the GDF parser \[8\]; two LFG parsers are being iutegrated, and a FUG parser is planned.</Paragraph>
    <Paragraph position="3"> The French test-suite and the grarmnax are both already fairly well developed. The basic gramumr provided with the envirormaent is the keystone of the whole system. It allows using the environment directly ~md without further work, sam also serves as a testbench for the computational solutions to liugtfistic problelrLs. '\]?he test-suite serves as a guideline for tile coverage of (system-provided or user-defined) grarmnars, to test whether they accept an independently established corpus of written sentences which exemplify the nmiu linguistic problems anti phenonmna of the language.</Paragraph>
    <Paragraph position="4"> ACRES DE CO1JNG-92, NANqE.S. 23-28 not'yr 1992 1 I 7 7 l'aoc, ov COLING-92. NANTES, AUG. 23-28. 1992 Wtfile defining a French lexicon was not one of the main objectives of the project, having a lexicon is an mmvoidable requirement for testing grammars and analyzers and the treatment of lexical information became an important coinponent of the work. The need to access a single lexicon required a study of the normalization of lexical information which led to interesting questions about the reusability of syntactic features. Detining development management tools turned out to pose challenging theoretical problems. The History component keeps track of grammar development and modification, and is complementary to the Coherence component which validates a state of the grasmttar. The Generation component allows the linguist to test limit cases in the grammar, both from tile point of view of analysis complexity and in order to check overgeneration.</Paragraph>
    <Paragraph position="5"> We start our description with the module making the system usable as a development tool for linguistic software, i.e. the set of graptfical utilities for the visual representation of tile grammar, the analysis process and the results.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 User environment
</SectionTitle>
    <Paragraph position="0"> EGL lets the user parazneterize execution and control commands, explore their results, amt visualize and edit lexicai and syntactic knowledge.</Paragraph>
    <Paragraph position="1"> In contrast with earlier approadles such as \[4\], we tlfink that user interface standards are now sufficiently ilmture to allow reasonably portable software to be developed, and most of these frmctions are part of a graphical user interface running under X-window Motif. The EGL graphical user interface is best illustrated with the parsing tools, wtfich are directed towards both the greanmar developer and the parser developer. The user can select a sentence, control parser execution, mtd explore the results. During parsing, the user can display the chart and watch it evolve dynanfically. The agenda of awaiting chart tasks can also be displayed and manipulated. Tiffs allows the parser developer to e~cperiment mannally with chart parsing strategies before integrating them into the parser.</Paragraph>
    <Paragraph position="2"> After parsing~ the grammar developer can display the relevant structures (derivation trees, feature structures, rules used, etc.) and navigate through them. The whole user interface behaves as a structure inspector, or hypertext-style browser, with displays and limks tailored to the linguistic needs and habits of ti~e user.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Development Management Tools
</SectionTitle>
    <Paragraph position="0"> Besides the test suite elaborated for the project, three validation tools contribute to grammar development: the tIistury, Coiterenee and Generation components. As the test suite and the ftistory components are described in detail elsewhere \[5\], we will spend more time on the Coherence and Generation components. They are both based upon a formalism which is common to GPSG, LFG and FUG, and thus able to include all tile data and constraints of those three frameworks. In this way, EGL goes beyond previous projects such as \[8, 7\] and provides a common tool for various frameworks.</Paragraph>
    <Paragraph position="1"> A gT~mmar consists of four sets (category, (ID-)rule, LP-rule and metarule). 2 Each set includes both data and principles. A principle is a constraint that must apply everywhere mid which defines the admissible data.</Paragraph>
    <Paragraph position="2"> A category (I, F, A) is represented as: 3 e A categorial identifier I, which is a symbol identifying the category.</Paragraph>
    <Paragraph position="3"> A formula/~', which defines constraints applicable to the category. These are deduced from the rule that generated the category, or from principles. The allowed predicates are: standard D, constrained D~, default 3d deduction; standard --, constrained =%, default -a ttnification; negation -,, ration /', and disjtmction Y.</Paragraph>
    <Paragraph position="4"> * An attribute-wdue structure A. A value may be atomic or complex (itself an attribute-value structure). It can be dedared explicitly (with constants) or implicitly (referring to another value in the structure, thus allowing data sharing).</Paragraph>
    <Paragraph position="5"> Local trees stem fronl rewrite rules, 4 constrained by LP-rules and principles, s The precedence constraints can be mentioned in the right-hand side of a rule inside the rule as well as a principle via precedence rules. This expressive power ('allowing &amp;quot;formalism mixing&amp;quot;) facilitates  AcrEs DE COL1NG-92, NA/VI~..S. 23-28 AOt~l 1992 1 1 7 8 PRoC. ov COLING-92. NAN'rJ~s. AUG. 23-28, 1992 grannnax development. Two exaauples:</Paragraph>
    <Paragraph position="7"> The m,'dn protdem in the Coherence coinponent is that of salisfiabilit~, ls there any valid parse with the user's graznmar? Besides satisfiability, some questions are of great interest from a linguistic point of view, e.g. sufficiency and necessity of all the data. A grammar must be structurally coherent, and we say that a grarnmar is coherent iff it satisfies: o non-cyclicity: there is no cyclic point.</Paragraph>
    <Paragraph position="8"> , non-redtmdancy: A is redmidant w.r.t. B in a grammar S iff S-A has the stone strong generative capacity as S-B.</Paragraph>
    <Paragraph position="9"> non-superffifity: A is superfluous in S iff S aml S-A have the same strong generative capacity.</Paragraph>
    <Paragraph position="10"> accessibilJty-coaccessibility: data is accessible (resp. coarcessible) iff used at least once in generation (resp. a parse).</Paragraph>
    <Paragraph position="11"> We have shown 12\] that cyclicity, redundancy ,'rod superfluity are subproblems of accessibility: an accessibility algorittun can be used as a necessary condition for the three other problems, lit a context-free granmlar, linguistic coherence can be tested locally. Therefore, a first pass applies to a context-free paxt of the grarunmr (without data shaxing nor nonmonotonic atonfic formulas). A second, global, pass uses label propagation, where labels are defined by constraints. We are also investigating a clique method to treat accessibility in a trartahle way \[9, 2\].</Paragraph>
    <Paragraph position="12"> The inputs to the Generation component are the following constraints: s on the graummr: specification of obligatory, forbidden or cooccurrent rules, * on ternfinal nodes: specification of complex structures that deternfine terminal nodes types, * on iuitial structures: specification of incomplete parse trees.</Paragraph>
    <Paragraph position="13"> These parameterizations were easily included into tlm formalisnt, but problems occur with tire algorithm itself, which chart Mgoritlmls are insufficient to deal with. Three agendas take care of post-modification of nodes in incomplete trees, thus extending Slffeber's algorithm \[21, 18\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Linguistic Descriptions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Gralntnar
</SectionTitle>
      <Paragraph position="0"> &amp;quot;\];he development of tim GPSG granunax for 1,~rench cau be traced through three steps.</Paragraph>
      <Paragraph position="1"> First, we implemented a demonstration groam mar \[12\], patterned alter tile English granunar described in the GDF, User ManuM \[8\]. In terms of coverage, tiffs French grammar cau handle some simple questions, wtfich required the definition of two additional nmtarules, ht terms of gralmnar-writing style, following a suggestion of \[22, pp. 115-t19\], we detine the person feature in temps of two tffnary featm'es, EGO said PTC (participant). Finally, agreement is a nmch more pervasive phenontenon in French than in English, and ntaaly more eases nmst be taken into arcomit: adjective/noun, determiner/noun, adjectival predicate, arid the past participle.</Paragraph>
      <Paragraph position="2"> As a second step, we developed a GPSG-based I,'rend~ grauunax &amp;quot;.along the lines of the \]:~iIglish gratnnmr described in \[15\]. Although the linguistic coverage is sinfflax in both of them, the l'arench graumlaX is only loosely patterned after the Enghsh one.. Its development was broken into subtasks according to the types of constituents encountered (AI', NP, VP ...) as well as to the types of specific linguistic problems to be accounted fl~r (e.g. agreement, comparatives and coordination), lu generM, the rides in our graxmuax axe driven by lexicM infornmtion: we ttms model our computational grammax on tim results of current linguistic theory.</Paragraph>
      <Paragraph position="3"> Our treatment of agreement is fairly complete \[13\]. For example, we can handle complex color adjectives (des robes vert bouteille, &amp;quot;bottle-green dresses&amp;quot;), predicate APs (los robes sont reties, &amp;quot;the dresses are green&amp;quot;), mid past participles (les dtudiantes que les policiers out matraqu~es, &amp;quot;the students that the police beat up&amp;quot;).</Paragraph>
      <Paragraph position="4"> Tim treatment of VPs is extensive \[14\] attd includes the positioning of clitics \[3\] and of negation. l,exical VI iteius are used to handle complex tenses ,~Ld the positioning of negation mid certain adverbs. We strived to ndnhiffze the nuntber of lexical II)-rules and tackle tim problem of &amp;quot;categoriM distortion&amp;quot; \[20\] (in particular, the granunar ca:u account tor complement sub-categorization alternations in a systematic way).</Paragraph>
      <Paragraph position="5"> The treatment of 1qPs was found to cause At:IEs DE COLING-92, NAtCI'ES. 23-28 AOt~l 1992 1 1 7 9 PROC. oF COLING-92. NANIT.S. Auo. 23-28. 1992 more serious problems. Although we were able to pattern our treatment of modifiers after \[15\], that of specifiers is more problematic \[19\]. It has rapidly become clear that semantic information is necessary for a satisfactory solution. Thus, the third step is to enrich our morpho-syntactic grammar with a semantic component \[6\].</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Lexicon
</SectionTitle>
      <Paragraph position="0"> A lexical database is obviously necessary to perform any test on gramm_,3rs and parsers.</Paragraph>
      <Paragraph position="1"> Defining a French lexicon within the GPSG forrealism was not one of our goals but, in parallel to the syntactic database, we had to construct a lexicon couched in a formalism compatible with different grammars and with enough coverage to be useful. Like the grammar provided with the environment s this lexicon can be taken as is, or be replaced by the users. We eventually settled on (automatically) transforming the information present in an already existing dictionary (the CNET lexicon) to serve as the lexical database. 6</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Normalizing Lexical Information
</SectionTitle>
      <Paragraph position="0"> In building a linguistic environment which is both French specific and usable by separate users with independently built systems, we knew that these would require lexical information to be presented in different ways. However, with the assumption that all of the lexical information necessary for the various syntactic analyses is actually present in the lexicon provided with EGL, we make the hypothesis that the content of this information is common to the various systems.</Paragraph>
      <Paragraph position="1"> Since an increasing number of grammatical formalisms put a large part of the linguistic description in the lexicon, we are interested in the nature and complexity of lexical entries, in the division of information between grammar and lexicon, in the representation of the syntactic information in the lexicon, as well as in the use of texical information in the grammar. Normalizing this information thus became an important part of the linguistic aspect of the project: the features in the pre-existing lexicon had to be transformed to serve as the basis for a &amp;quot;neutral&amp;quot; lexicon, Which must be usable by grammars not written in the same framework as that of the CNET.</Paragraph>
      <Paragraph position="2"> eThc CNET lexicon has more than 55000 entries defined with 200 keywords. The lexicon is transformed into minimal automata with quasi-linear time complexity for access. The compactness of the automata allows them to be resident in core memory.</Paragraph>
      <Paragraph position="3"> First, a correspondence was established between the syntactic and morpho-syntactic features of the CNET lexicon and the features required in systems created by members of the project: the GIREIL grammar; the LN-2-3 granlmar (INSI~.~RM); the ELU grammar (ISSCO). From the list of features used by each of them, we extracted those that pertain to the lexicon. We only considered attributes required by the grammars at the lexical level, thus discarding the features which represent information that cml only be evaluated during processing, i.e.</Paragraph>
      <Paragraph position="4"> which cannot be present in a lexical entry (e.g.</Paragraph>
      <Paragraph position="5"> VEUT-AUX-COMPOSE on a complex verbal form for LN-2-3, or REL on a nominal form in ELU). Since all three systems adopt to some extent a lexicallst approach mid include a large amount of syntactic information in the lexicon, this division required a detailed interpretation of their internal workings.</Paragraph>
      <Paragraph position="6"> Conversely, although morphological analysis is most often performed in a separate component (i.e. inflected forms do not constitute separate lexical entries), morphological information is included in our normalization, because that informarion must be present on the lexemes serving as starting points for the syntactic analysis.</Paragraph>
      <Paragraph position="7"> We then put in correspondence the lexical features of the various systems; here again, it was necessary to interpret the way they are actually used (e.g. in the representation of reflexive constructions). The normalization of the morpho-syntactic features required in these three grammars can now be extended to other grammatical analyses through the more general list of features established for the mapping which allows each system to recover in the lexicon the information it needs to perform an analysis.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> While French has been the object of relatively extensive research in computational linguistics, no extensive formal description of that language has been integrated in a linguistically motivated development environment. The EGL project is part of a growing trend towards a wider linguistic coverage coupled with greater flexibility.</Paragraph>
    <Paragraph position="1"> Designing a linguistic development environment requires making sonic fundamental choices about the grartmlatical forlnalism, and the evaluation of competing formalisms depends on assumptions inlposed by the task at hand (corn-ACTE~ DE COLING-92. NANT~, 23-28 ^O~' 1992 1 1 8 0 PROC. OV COLING-92, NANTES, AUO. 23-28, 1992 plexity, deternfiulsm, performance degradation in case of unforeseen input, use and integration of semantic information). The use of NL as a medimn for communication between loan and nmchine renders desirable the adaptability of an NLP system to various linguistic forlnalisms.</Paragraph>
    <Paragraph position="2"> However, if automatic information processing projects now more often include an NL component, that component is generally &amp;quot;closed&amp;quot; a~td unmodifiable: few systems are designed to provide the syntactic analysis of natural language texts or to be usable in various contexts. 7 In EGL, several of the modules nmy be reused outside of the grammatical formafisln chosen for our own linguistic description. This basic reqrfirement of system design can have important consequences when we want to tailor the system to applications where the linguistic domain is limited, which is the case in most natural laalguage interface applications. As a design tool, EGL makes it possible to see simultaneously arul to manipulate easily each of its components.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML