File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2187_metho.xml
Size: 15,160 bytes
Last Modified: 2025-10-06 14:14:20
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2187"> <Title>Software Reuse, Object-Orientated Frameworks</Title> <Section position="3" start_page="0" end_page="1057" type="metho"> <SectionTitle> 1 Resource Reuse and Natural Language Engineering </SectionTitle> <Paragraph position="0"> Car designers don't reinvent the wheel each time they plan a new model, but software engineers often find themselves repetitively producing roughly the same piece of software in slightly ditfenmt R)rm. The reasons for this inefficency have been extensively studied, and a number of solutions are now available (Prieto-Diaz and t~h'eeman, 1987; Prieto-Diaz, 1993). Similarly, the Natural Language Engineering (NLE l) community has identified the potential benefits of reducing repetition, and work has been flmded to promote reuse. This work concerns either reusable resources which are primarily data or those which are primarily algorithmic (i.e. processing 'tools', or programs, or code libraries).</Paragraph> <Paragraph position="1"> Successflfl examples of reuse of data resources include: the WordNet thesaurus (Miller el; al., 1993); the Penn Tree Bank (Marcus et al., 1993); the Longmans Dictionary of Contemporary English (Summers, 1995). A large number of papers report results relative to these and other resources, and these successes have spawned a num1See (Boguraev et al., 1995) or (Cunningham et al., 1995) for discussion of the significance of this label. ber of projects with similar directions, one of the latest examples of which being ELRA, tile Euro-I)ean Language Resources Association.</Paragraph> <Paragraph position="2"> The reuse of algorithmic resources remains more limited (Gunninghaln et al., 1994). There are a number of reasons for this, including: 1. cultural resistance to reuse, e.g. mistrust of 'foreign' code; 2. integration overheads.</Paragraph> <Paragraph position="3"> In some respects these probleIns are insoluble without general changes in the way NLE research is done researchers will always be reluctant to use poorly-documented or unreliable systems as part of their work, for exmnple. In other respects, solutions are possible. They include: 1. increasing the granularity of the units of reuse, i.e. providing sets of small buildingblocks instead of large, Inonolithic systems; 2. increasing the confidence of researchers in available algorithmic resources by increasing their reuse and the amount of testing and evaluation they are subjected to; 3. separating out, the integration problems that are independent of the type of information being processed and reducing the overhead caused by these problems by providing a software architecture for NLE systems.</Paragraph> <Paragraph position="4"> Our view is that succesful algorithmic reuse in NLE will require the provision of support software for NLE in the form of a general architecture and development environment which is specifically designed for text processing systems. Under EPSRC 2 grant GR/K25267 the NLP group at, the University of Sheffield are developing a system that aims to implement this new approach. The system is called GATE - the General Architecture for Text Engineering.</Paragraph> <Paragraph position="5"> ~The Engineering and Physical Science Research Council, UK funding body.</Paragraph> <Paragraph position="6"> GATE is an architecture in the sense that it provides a common infrastructure for building language engineering (LE) systems. It is also a development environment that provides aids for the construction, testing and evaluation of LE systems (and particularly for the reuse of existing components in new systems). Section 2 describes GATE. A substantial amount of work has already been done on architecture for NLE systems (and GATE reuses this work wherever possible). Three existing systems are of particular note:</Paragraph> </Section> <Section position="4" start_page="1057" end_page="1057" type="metho"> <SectionTitle> * ALEP (Simpkins, Groenendijk 1994), which </SectionTitle> <Paragraph position="0"> turns out to be a rather different enterprise from ours; * MULTEXT (Thompson, 1995), a different but largely complimentary approach to some of the problems addressed by GATE, particularly strong on SGML support; * TIPSTER (ARPA, 1993a) whose architecture (TIPSTER, 1994; Grishman, 1994) has been adopted as the storage substructure of GATE, and which has been a primary influence in the design and implementation of the system.</Paragraph> <Paragraph position="1"> See (Cunningham et al., 1995) for details of the relation between GATE and these projects.</Paragraph> </Section> <Section position="5" start_page="1057" end_page="1059" type="metho"> <SectionTitle> 2 GATE </SectionTitle> <Paragraph position="0"> Architecture overview GATE presents LE researchers or developers with an environment where they can use tools and linguistic databases easily and in combination, launch different processes, say taggers or parsers, on the same text and compare the results, or, conversely, run the same module on different text collections and analyse the differences, all in a user-friendly interface. Alternatively module sets can be assembled to make e.g. IE, IR or MT systems.</Paragraph> <Paragraph position="1"> Modules and systems can be evaluated (using e.g.</Paragraph> <Paragraph position="2"> the Parseval tools), reconfigured and reevaluated - a kind of edit/compile/test cycle for LE components. null GATE comprises three principal elements: * a database for storing information about texts and a database schema based on an object-oriented model of information about texts (the GATE Document Manager GDM); null * a graphical interface for launching processing tools on data and viewing and evaluating the results (the GATE Graphical Interface GGI); null * a collection of wrappers for algorithmic and data resources that interoperate with the database and interface and constitute a Collection of REusable Objects for Language Engineering- CREOLE.</Paragraph> <Paragraph position="3"> GDM is based on the TIPSTER document manager. We are planning to enhance the SGML capabilities of this model by exploiting the results of the MULTEXT project.</Paragraph> <Paragraph position="4"> GDM provides a central repository or server that stores all the information an LE system generates about the texts it processes. All communication between the components of an LE system goes through GDM, insulating parts fi'om each other and providing a uniform API (applications programmer interface) for manipulating the data produced by the system. 3 Benefits of this approach include the ability to exploit the maturity and efficiency of database technology, easy modelling of blackboard-type distributed control regimes (of the type proposed by: (Boitet and Seligman, 1994) and in the section on control in (Black ed., 1991)) and reduced interdependence of components.</Paragraph> <Paragraph position="5"> GGI is in development at Sheffield. It is a graphical launchpad for LE subsystems, and provides various facilities for viewing and testing results and playing software lego with LE components - interactively assembling objects into different system configurations.</Paragraph> <Paragraph position="6"> All the real work of analysing texts (and maybe producing summaries of them, or translations, or SQL statements, etc.) in a GATE-based LE system is done by CREOLE modules.</Paragraph> <Paragraph position="7"> Note that we use the terms module and object rather loosely to mean interfaces to resources which may be predominantly algorithmic or predominantly data, or a mixture of both. We exploit object-orientation for reasons of modularity, coupling and cohesion, fluency of modelling and ease of reuse (see e.g. (Booch, 1994)).</Paragraph> <Paragraph position="8"> Typically, a CREOLE object will be a wrapper around a pre-existing LE module or database -- a tagger or parser, a lexicon or n-gram index, for example. Alternatively objects may be developed from scratch for the architecture - in either case the object provides a standardised API to the underlying resources which allows access via GGI and I/O via GDM. The CREOLE APls may also be used for programming new objects.</Paragraph> <Paragraph position="9"> The initial release of GATE will be delivered with a CREOLE set comprising a complete MUCcompatible IE system (ARPA, 1996). Some of the objects will be based on freely available software (e.g. the Brill tagger (Brill, 1994)), while others are derived from Sheffield's MUC-6 entrant, LaSIE 4 (Gaizauskas et al., 1996). This set is called VIE a Vanilla IE system. CREOLE should expmld quite rapidly during 1996-7, to cover a wide range of LE I{&D components, but for the rest of this section we will use IE as an example of the intended operation of GATE.</Paragraph> <Paragraph position="10"> The recent MUC competition, the 6th, detlned four IE tasks to be carried out on Wall Street Journal articles. Developing the MUC system upou which VIE is based took approximately 24 person-months, one significant element of which was coping with the strict MUC output specifications. What does a research group do which either does not have the resources to tmiht such a large system, or even if it did would not want to spend effort on areas of language processing outside of its particular specialism? The answer until now has been that these groups cannot take part in large-scale system building, thus missing out on the chance to test; their technology in an application-oriented environment and, perhaps more seriously, missing out on the extensive quantitative ewdualion mechanisms developed in areas such as MUC. in GATE and VIE we hope to provide an environment where groul/s can mix and match elements of our MUC technology with componeuts of their own, thus allowing the benefits of large-scale systems without the overheads. A parser developer, for example, can replace the parser sut)plied with VIE.</Paragraph> <Paragraph position="11"> Liceneing restrictions preclude tile distribution of MUC scoring tools with GATE, but Shetfield may arrange for evaluation of data I)rodu(:ed by other sites. In this way, GATE/VIE will support comparative evaluation of LE conq)olmnts at a lower cost than the ARPA programmes (ARPA, 1993a) (partly by exploiting their work, of course!). Because of the relative informality of these evaluation arrangelnents, and as the range of evaluation facilities in GATE expands beyond the four IE task of tile current MUC we should also be able to offset the tendency of evaluation progralnnms to (lamt)en imlovation. By increasing the set of widely-used and evaluated NLP components GATE aims to increase the eonfiden(:e~ of LE researchers in algorithinie reuse.</Paragraph> <Paragraph position="12"> Working with GATE/VIE, the researcher will Don, the outset reuse existing components, I;he overhead for doing so being much lower than is conventionally the case instead of learning new tricks for each mo<lule reused, tile common APIs 41~m'ge-S(:alc IE.</Paragraph> <Paragraph position="13"> of GDM and CREOLE mean only one integration nmchatiisih must be learnt. And as CREOLE (,'xpands, more and more modules and datahases will be available at low cost. We hope to move towards sub-component level reuse at some fl~ture point, possibly providing C++ libraries as part of all OO LE framework (Cunningham et al., 1994).</Paragraph> <Paragraph position="14"> This addresses the need for increased granularity of the units of reuse as noted in section 1.</Paragraph> <Paragraph position="15"> As we built our MUC system it; was often the case that we were unsure of the implications for system performance of using tagger X instead of tagger Y, or gazeteer A instead of pattern marcher B. In GATE, substitution of <'omponents is a t)oint-and-click operation in tile GGI interface. (Note that delivered systems, e.g. EC proje, ct demonstrators, can use GDM and CREOLE without GGI see below.) This facility supports hybrid systems, ease of upgrading and open systemsstyle module inte, rchangeability.</Paragraph> <Paragraph position="16"> Of course, GATE does not; solve all the problems involved in plugging <liverse LE modules together. There are two barriers to such integration: * incompatability of 7~presentation of information about text and tile mechanisms for storage, rctriewJ and inter-module communication of that information; * in(:ompatability of type of information used and produced by different modules.</Paragraph> <Paragraph position="17"> GATE enforces a separation between these two and provides a solution to the former (based on the work of the TIPSTER architecture group (TIPSTER, 1994)). This solution is to adopt a common model for expressing information about text, and a common storage mechanism for managing that information, thereby cutting out signif leant parts of the integration overheads that often block algorithmic reuse. Because GATE places no constraints on the linguistic formalisms or im formation content used by CREOLE objects (or, for that matter, the programming language they are iinplemented in), the latter problem must be solved by dedicated translation functions e.g.</Paragraph> <Paragraph position="18"> tagsct-to-tagset mapping and, in solne cases, by extra processing - e.g. adding a semantic processor to complement a bracke, ting parser in order to produce, logical form to drive a discourse interpreter. As more of this work is done we can expect the overhead involved to fall, as all results will be available as CIt,EOLE objects, hi the early stages SheflMd will provide some resources for this work in order to get the ball rolling, i.e. we will provide help with CREOLEising existing syst;ems and with developing interface routines where practical and necessary. We are confident that integration is possible (partly because we believe that differences between representation formalisms tend to be exaggerated) - and others share this view, e.g. the MICROKOSMOS project (Onyshkevych et al., 1994).</Paragraph> <Paragraph position="19"> GATE is also intended to benefit the LE system developer (which may be the LE researcher with a different hat on, or industrialists implementing systems for sale or for their own text processing needs). A delivered system comprises a set of CREOLE objects, the GATE runtime engine (GDM and associated APIs) and a custom-built interface (maybe just character streams, maybe a Visual Basic Windows GUI, ... ). The interface might reuse code from GGI, or might be developed from scratch. The LE user has the possibility to upgrade by swapping parts of the CREOLE set if better technology becomes available elsewhere.</Paragraph> <Paragraph position="20"> GATE cannot eliminate the overheads involved with porting LE systems to different domains (e.g. from financial news to medical reports). Tuning LE system resources to new domains is a current research issue (see also the LRE DELIS and ECRAN projects). The modularity of GATE-based systems should, however, contribute to cutting the engineering overhead involved.</Paragraph> </Section> class="xml-element"></Paper>