File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2123_metho.xml
Size: 25,852 bytes
Last Modified: 2025-10-06 14:12:59
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-2123"> <Title>TOWARDS COMPUTER-AIDED LINGUISTIC ENGINEERING</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> ACRES DE COLING-92, Nnt, rrE~s, 23-28 ao~r 1992 8 2 7 </SectionTitle> <Paragraph position="0"> La g6n6ration de progrmlmaes sp6cifiques d'aualyse ou de g6n6ration peut ~tre automatis6e dans la mesure ou les langages de programmation cibles sont des langages de programmation logique ~t contraintes dont les structures de donndes sont des structures de traits typ6s.</Paragraph> <Paragraph position="1"> I.es diffdrents 616merits constituant cette approche sont actuellement darts un 6tat d'avancement vari6, qbutefois, cette approehe est d6ja partiellement utilis6e par diff6rents groupes clans plusieurs projets nationaux et europ6ens, en particulier dans le domaine des dictioImai~Tes 61ectroniques. null Pltc~c. o1: COLING-92, NAbrIES, AUTO, 23-28. 1992</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> TOWARDS COMPUTER-AIDED LINGUISTIC ENGINEERING RI~MI ZAJAC GRIL </SectionTitle> <Paragraph position="0"> We outline a framework for computer-aided linguistic engineering based on the automatic generation of NLP programs from specifications, mid an automated construction of reusable linguistic specifications. The specification language is based on Typed Feature Structures, and the target programming language is a constraint logic programming language which data structures are typed feature structures. Reusability of linguistic specification is enhanced by the organization of the specifications in an object-oriented style in several Myers of increasing specificity, supporting for example the incremental development of grammar specification for sublanguages.</Paragraph> <Paragraph position="1"> 1 A framework for NLP Software Engineering The development of reliable high-quality linguistic software is a time-consuming, error-prone, and costly process. A parser used in an industrial NLP system is typically developed by one person over several years. The development of a linguistic engio neering methodology is one of the major in the development of a language induslry. The process of developing an NLP application is an application and an adaptation of the classicalsoftware engineering development methodology and follows three major steps: the initial requirements and specifications expressed in natural language, the formal specification of the system, and finally the implementation of the system \[Biggerstaff/Perlis 89\].</Paragraph> <Paragraph position="2"> The requirements specific to a linguistic engineering methodology are: 1. The initial requirements are complemented by a corpus giving typical examples of the texts and linguistic phenomena contained in these texts to be treated by the system; 2. The set of formal specifications constitutes a standardized repository of formalized linguistic knowledge reusable across different NLP applications - a crucial property given the sheer size of granrmars and dictionaries; executable - to be able to test the specifications against corpora.</Paragraph> <Paragraph position="3"> 3. NLP programs are generated (semi-) automatically from formal specifications.</Paragraph> <Paragraph position="4"> These particularities have the following implications: null 1. The availability of a corpus allows to develop a methodology based on sublanguages and corpus analysis, automating the knowledge acquisition process.</Paragraph> <Paragraph position="5"> 2. The linguistic specification does not include any information specific to some application (espe null cially, it does not contain any control information), thus the same specification can be reused for different applications (genericity).</Paragraph> <Paragraph position="6"> A specification language for describing linguistic knowledge could be based on a feature logic and has an object-oriented inheritance style that makes it possible to distinguish formally between generic knowledge and specific (e.g., sublanguage) knowledge, thus enabling the reuse of specifications in the development of the specifications tllemselves.</Paragraph> <Paragraph position="7"> The expressive power of the specification language (a non-decidable subset of first order logic) allows to remove the conventional distinction between dictionaries and grammars, providing a single homogeneous framework for an integrated development of linguistic knowledge bases.</Paragraph> <Paragraph position="8"> The use of a feature-based language also favors standardization, as feature structures become a &quot;lingua franca&quot; for computational linguists. Several modem specialized linguistic programming languages can be the targets of the automated generation process. Since the specification language is based on typed feature structures, natural candidates are ~unification-based grammar formalisms..</Paragraph> <Paragraph position="9"> ology should also address the 1011owing poinls: * strict separation between pure linguistic knowledge and knowledge about strategies tor its use in a particular application, a condition sine qua non for reusability; , concepts of modularity for lingnistic description, e.g., formal separation of knowledge pertaining to different levels of linguistic description, organization of linguistic knowledge in hierarchies (from generic to specific); * team organization of linguistic development projects.</Paragraph> <Paragraph position="10"> 1 Reusable linguistic descriptions In software engineering, the use of the tenn <<reusability>> covers two main trends: the composition-based approach and the generation-based approach. In the first approach, software components can be plugged together with no or smaU modifications in order to build software systems: programming langnages such as ADA or object-oriented languages are designed to support this type of reuse. This approach is successful when the components are small and perform very precise functions, as li3r numerical analysis \[Biggerst,'fff/l'erlis 891. In NLP, this approach is exemplified by the reu~ of various ,<engines>> such as parsers.</Paragraph> <Paragraph position="11"> In the second approach, software components are generated (semi-automatically) from a ~t of formal specifications, instantiating these specifications in a programming language by choosing appropriate data representations and control structures: the knowledge expressed in the specification is reused in various contexts to generate different applications. This approach is successful when a fair alnount of domain knowledge is built into the specilication and the generation environment, e.g., business knowledge in</Paragraph> <Paragraph position="13"> Tiffs is the approach we envisage for producing NLP programs.</Paragraph> <Paragraph position="14"> To support reusability and incremenlal development of specifications, we organize and describe linguistic knowledge using partial specifications and controlled degrees of abstraction in the overall design. Tiffs approach should of course be supported by a specification language which will be based on the concept of partial information and provides the means of stmcturing a specification in a hierarchy of subspecifications of increasing specificity.</Paragraph> <Paragraph position="15"> We envisage three basic levels of abstraction. The i~titial design of the linguistic domain is rather abstract and largely free of details. It establishes the basic buildings blocks, the basic structures and the foundations of the linguistic domait~. At that level, we could aim at providing a eonsensual formal deft o nition of tbese basic building blocks as a first step towards the definition of standards for representiug linguistic knowledge. For example, the initial level of abstraction could start from basic descriptive classiticalions, e.g. at the categorial level nouns, verbs, etc., and li'om the basic syntactic dependencies between these categories, and give them a fnrmal delinition.</Paragraph> <Paragraph position="16"> A second level of specialization makes choices as for the distribution of linguistic properties into more line grained categories. At that level, we observe the emergence of linguistic theories, where choices are triggered by tlleoretical assumptions. Given the relative freedom of structuration, the choice between competing representations should be guided by the concern for modularity and reusability (internal consla'aints) and by the external constraints on the coverage and the adequacy of the linguistic representation to the needs of NLP of applications. Linguistic specifications should be developed as a set of independently defined nmdules with well-defined interconnections: modularity is essential in supporting reusability aud team work in the development of specilications.</Paragraph> <Paragraph position="17"> At the third level of specialization, the lingnistic organization principles are instantiated in the fully detailed description of specilic linguistic phenomena. This level is sufficiently detailed to test the specification against actual sentences (strings of word tbnns). Previous levels can 'also be tested but only against abstract descriptions representing sets of sentences. Tius is also tile level at which we have several diflerent i~tstances corresponding to diflerent sublanguages, each sublanguage description reusing the same first mid second levels of specification, freeing the linguistic of redoing the same design decisions for each instance. There could also be a smlcturation among sublanguages which could introduce finer levels of abstraction, thus achieving a higher degree of reusability.</Paragraph> <Paragraph position="18"> This overall framework in winch each level sets partial cxmstraints on the most specific instances is able to support the incremental developnrent of linguistic knowledge by successive refinements and thus, fartiler reusability.</Paragraph> <Paragraph position="19"> ACTf!S t)'~COLING-92, N^N-I~.s, 23-28 ^otJr 1992 8 2 9 I'v:o~:. oI:COLING-92, N^l'rrgs, AUG. 23-28, 1992</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A linguistic description language </SectionTitle> <Paragraph position="0"> The crucial issue in the generation-based approach to reusability is the nature and the definition of the specification language. A specification language has to be defined and implemented as pure logic to fully support reusability. It should be suitable to describe the knowledge of a particular domain and should build on well-accepted notions and notations for that domain: here, natural language processing. In NLP, the emergence of unification-based grammar formalisms promoted the use of feature structures as a ,dingua franca>, for representing linguistic information.</Paragraph> <Paragraph position="1"> Although some work on unification-based grammar formalisms is motivated by reusability of linguistic specifications (e.g., ,<reversible grammars,,), such work does usually not address the problem of specifications in engineering terms. Furthermore, these formalisms make strong assumptions about the nature of linguistic representation 1 thereby limiting severely the expressive power of these languages.</Paragraph> <Paragraph position="2"> The linguistic specification language is based on a typed version of a logic for feature structures which allows to define specifications at different levels of abstraction. Using this language, it will be possible to eliminate the conventional division between lexical and grammatical knowledge, and also the division between generic and specific (e.g., 8ublanguage) knowledge.</Paragraph> <Paragraph position="3"> Such a specification language is executable (although it is potentially infinitely inefficient), and it should be executable for two reasons. First, since the formal specification is the first level of formality in the conception of a software system, correcmess cannot be proved by formal means. However, an executable specification language allows at least to test the specifications against examples. Second, it should be possible to derive an actual program (e.g., a parser) from a specification. An executable specification language ensures the basic feasibility of an automated generation of NLP programs.</Paragraph> <Paragraph position="4"> The specification language is formally based on a subset of first-order logic. In order to make it manageable and intuitive, it employs syntactic constructs called Typed Feature Structures (TFSs). The ,~vocabulary~ of the language, its signature, consists of unary predicates (sorts) and binary predicates (features). Moreover, there is an ordering on the sorts (yielding a lower semi-lattice). The structures over which the language is interpreted are determined in that they have to satisfy certain axioms: the features give partial functions, and the ordering on the sorts is 1. Which are sometimes only motivated by processing considerations. null reflected as subset inclusion (unary predicates give sets). They are not fully specific, however, which reflects the situation in knowledge representation where the domain of discourse is not completely specified. By adding new axioms, this domain is made more and more specific; in the extreme case, one structure is singled out.</Paragraph> <Paragraph position="5"> The sort signature is extendable through (recursive) definitions of new sorts; these are done by defining explicit constraints which come from the language itself (the TFS constraint language). The sorts are organized into an inheritance hierarchy, with a clean (logical, algebraic and type-theoretic) semantics of inheritance in the object-oriented programming style. The subset of first-order logic can be made more complex by adding logical connectives, such as negation and quantification.</Paragraph> <Paragraph position="6"> Given the signature, which defines the constraints available to the user, the user has the option to extend the language by specifying new predicates.</Paragraph> <Paragraph position="7"> These are interpreted as relations between the elements of the domain of the respective interpretation structure. The language is still a subset of first-order logic; thus, its syntax can be chosen like the one of definite clauses, but with TFS's instead of first-order terms.</Paragraph> <Paragraph position="8"> The specification language thus obtained allows the user to create partial specifications that can be incrementally extended, and to express controlled degrees of abstraction and precision. Although of considerable expressive power, this specification language is executable, but the control information is be abstracted; that is, formally the execution is nondeterministic, and there will be no explicit programming feature to express control. This has a good reason: control information coded in programs is specific to particular applicatiorts. For grammars for example, for the same underlying logical specification the control will be different in parsing or in generation, or even in different parsers (e.g., for indexing or for granunar checking). Thus, abstracting from control is important for gaining genericity: logical specifications apply to more problems than programs. The knowledge specification language is used in a first step in the generation of correct programs.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Automating the acquisition of </SectionTitle> <Paragraph position="0"> linguistic descriptions We assume that the acquisition of linguistic information will build upon the definition of broad linguistic AcrF.s DE COLING-92, NANTES, 23-28 AOt~T 1992 8 3 0 PROC. OF COLING-92, NAbrrES, AUG. 23-28, 1992 categories formalized as the initial and secondary level of linguistic abstraction described above. In a Computer-Aided Linguistic Engineering fnunework, the acquisition of linguistic inibrmation is targeted towards the needs of specific applications: we also assume that the linguist uses for testing purposes a set of examples of the kind of text Ire describes (test case). These exanlples (fire <~corpus>~) can be constmcted (as a way for example to specify file kind of dialogue envisaged fox&quot; a natural language man., machine interface) or can come from existing texts, for example, existing teclmical documentation, The acquisition of linguistic iulonuation coltsists in describing in lull detail the set of linguistic phenomena occurring in the corpus as a specialization of linguistic axioms and principles. The acquisition is performed in two steps. First, the linguist uses corpus analysis tools to characterize the particularities of the sublanguage phenomena occurring in the corpus and to define the coverage (sel ot' linguistic categories) that should be reached, q~en, the linguist describes formally (i.e., using the specification language) in all details phenomena occun'ing in the col pus, using corpus analysis tools to lind examples and to refine the categorization \[Ananiadou 90, Tsujii et al. 901.</Paragraph> <Paragraph position="1"> This approach to tim acquisition of linguistic knowledge leads to the delinition of a precise methodology (basic concepts and working procedures) supported by a specific set of sollware tools: . Concepts. The basic concepts underlying this methodology are the notions of sublanguage and coverage \[Grishman/Kittredge 86, Kittredge/ Lehrberger 82, Gristmlm~lirsclnnan/Ngo 86\].</Paragraph> <Paragraph position="2"> Given a corpus, a linguist should be able to give a high level description of it in terms of its linguistic particularities which are not lkmnd m other kinds of texts, and in terms of the set of lingttistic phenomena which are occurring in it: these concepts should be defined operationally to allow the linguist to apply them to actual texts.</Paragraph> <Paragraph position="3"> . Working procedure. A working procedure delines the steps to be taken in the acquisition of linguistic knowledge, both in larger steps (characterization of the corpus, then acquisition) and in details such as how to document the phenomena described, to link a formal description to exampies of the corpus, to check the consistency o1' the description with other parts of the specification, etc. It also gives examples of, e.g., how to detine new lexical semantic classes using a cluster analysis tool (see below).</Paragraph> <Paragraph position="4"> o Software tools, q he concepts and working procedures are suppo~ted by a set of specialized linguistic software tools integrated in a Computer~ Aided Linguistic Engineering workstation.</Paragraph> <Paragraph position="5"> These ~ltware tools suplx)rling the acquisition of linguistic knowledge should have tire tollowiug functio~mlities: . Taggh~g. A first set of fmictionafities is to tag a corpus using linguistic markels such as the category of word forms, their inflection, etc. Several levels of sophistication will be distinguished depending on the availal~ility of the appropriate set of pat~uneters: sels of closed categories, sets of word fonns, sets of nlorphemes, definition of phrase boundaries, etc.</Paragraph> <Paragraph position="6"> Text DBMS. A tagged coq)us is be loaded into a text DBMS for further exploitation, and accessed through a specialized linguistic interlace (using a specialized query language).</Paragraph> <Paragraph position="7"> . Statistics and cluster analysis. Two kinds of inl2mnation can be extracted linm a tagged corn pus: statistical inlbnnation and concordance and clustering ildbnnation. Statistical and clustering aualysis algorithms will be implemented and incorlxn'ated ,as l~unctionalities of the linguistic interlace of the text database.</Paragraph> <Paragraph position="8"> Semantic editor The essential operation in linguistic acquisition is the creation of specializafinns of existing categories. A semantic editor takes into account the delinition of existing classes and interactively guides the user in the creation of instances.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Automating the generation of NLP </SectionTitle> <Paragraph position="0"> programs In the development process sketched above (Section I) the last step is the implementation of the system. Automatic gencratinn of NI.P soltware Ires been locused to the (crucial) domain of lexical resources (how to build generic rcsom~;es and compilers that can extract electronic dictionaries from a lexical knowledge base lbr NLP systems) and to the domain of ,~reversible grammars,, 1.</Paragraph> <Paragraph position="1"> The process of transfomfing a specilication into an elficient program is very similar to compilation. If the structure of a set of specilication is stable, a compiler can be built to genelate a program. This is the approach envisaged for lexical infnnnation 2. Lexical \]. Seefor exmnple file I)taw.eedings of the ACL Workshop on Reversible Grammars, Berkeley, June 1991.</Paragraph> <Paragraph position="2"> Acq3/s DE COLING-92, NANTES, 23-28 An(n 1992 8 3 I l'r~oc. O1: (5OLINGO2, NANTES, AUC/;. 23-28, 1992 information is here considered as <<static, information: once the structure of the lexicon is defined, adding or removing an entry will not modify the compilation process. This is less tree for grammatical information which defines how the basic linguistic buildings blocks, i.e., lexical entries, are combined into larger structures. Here, the needs may vary depending on the processing requirements of different NLP applications. For example, a grammar checker and an indexing system will most probably not use the same parsing scheme: they will treat differently errors and ambiguities. Thus, a general approach is needed.</Paragraph> <Paragraph position="3"> Since the knowledge specification language is executable, this means that, to generate a program, there are two basic choices to be made: the selection of data structures and the selection of control structures. The nature and the complexity of these choices depend on the distance between the specification language and the targeted programming language.</Paragraph> <Paragraph position="4"> As a programming language into which the specifications are derived, we envisage to use the Constralnt Logic Programming (CLP) language LIFE developed at DEC-PRL \[Ai't-Kaci/Meyer 90, Ai't-Kaci/Podelski 91\]. The reason is that its formal foundation has parts in common with the Knowledge Specification Language; in particular, its basic data structures are also Typed Feature Structures, thus ensuring a basic level of compatibility between the two. Another reason is its descriptive power, its efficiency and its flexibility in execution (~datadriven.): LIFE subsumes the two main programming paradigms (logic programming, as in PRO-LOG, and functional programming, as in LISP or ML). That is, a .logic. (or ~functional>>) programmer may stick to his favorite programming style and still write code in LIFE.</Paragraph> <Paragraph position="5"> Since the data model is the same, to generate an efficient program form a specification, the user will only have to select appropriate control structures, For example, to generate dictionaries for a parsing program, the only refinement the user will have to develop is to define an efficient indexing mechanism that allows a parser direct access to a lexical entry. In generating NLP parsers or NLP generators, the user will have to choose between a functional control structure (as in ML) or a relational control structure.</Paragraph> <Paragraph position="6"> as in PROLOG. For the latter, additional choices have to be made, such as the ordering of clauses, the introduction of cuts, etc. \[Deville 90\]. Research in computational linguistics has identified a few central 2. This is also the approach envisaged in the ESPRIT project Multilex and in the Eurotra-7 study.</Paragraph> <Paragraph position="7"> computational concepts appropriate for NLP, among them regular grammars and regular transducers, augmented context-free grammars and tree transducers.</Paragraph> <Paragraph position="8"> In particular, augmented context-free grammars are the framework of the research in so-called ~<reversible grammars>,. This research can be used in the development of NLP processing schemes defined as annotations to the specification \[Deville 90, Uszkoreit 91\].</Paragraph> <Paragraph position="9"> Assuming that a set of specifications is stable, it is possible to write a specialized compiler to generate a LIFE program for, e.g., parsing or generation. This compiler will embed the control choices that a designer of a parser makes when developing a parsing algorithm. This kind of generation has been shown practically feasible for lexieal information, and research on ,<reversible grammars~> has demonstrated the feasibility for grammatical information as well (see for exanlple \[Dymetman/Isabelle 88\] who present a prototype of a machine translation system capable of translating in both directions using the same grammars and dictionaries).</Paragraph> <Paragraph position="10"> However, we have also a long term more ambitious goal, which is to develop methods and tools for fi.dly automating the generation of a program. Using these tools, the user will interactively guide the system in the generation of a program, experimenting with various choices and recording the design decisions for control to be used in a fully automatic step once the design is completed \[Biggerstaff/Perlis 89\].</Paragraph> </Section> class="xml-element"></Paper>