XML Viewer - w98-1208

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1208_metho.xml
Size: 37,562 bytes
Last Modified: 2025-10-06 14:15:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1208">
  <Title>I! | Implementing a Sense Tagger in a General Architecture for Text Engineering</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 GATE - the concept
</SectionTitle>
    <Paragraph position="0"> GATE is an architecture and development environment for research and development workers in NLP and Language Engineering 1. It is an architecture in the sense that it specifies a macro-level organisationai pattern for the various components and data resources that make up a language processing (actually at present only text processing) system (Shaw and Garlan, 1996). It is also a development environment that adds a rich set of graphical tools to the architecture enabling the developer to easily integrate new processing componemts, to manage flow of control between components, to visualise the data produced by and passed between components, and evaluate the contribution of components to some externally defined and measured language processing task.</Paragraph>
    <Paragraph position="1"> As we've noted elsewhere (Cunningham , Gaizauskas, and Wilks, 1995; Cunningham, Wilks, and Gaizauskas, 1996b), the motivating factors behind development of the architecture included the facilitation of reuse of components (which has previously been successful in NLP only tThe application of NLP and CL theory to the creation of practical applications software has recently become known as Language Engineering, or LE, or NLE, and has been defined in various ways in e.g. (Mitkov, 1996; Thompson, 1985; Boguraev, Garigiiano, and Tait, 1995; Gazdar, 1996). Our gloss on these various definitions is that Language Engineering is the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and to a body of practise.</Paragraph>
    <Paragraph position="2"> Cunningham, Stevenson and Wilks 59 Implementing a Sense Tagger Hamish Cunningham, Mark Stevenson and Yorick WilEs (1998) Implementing a Sense Tagger in a General Architecture for Text Engineering. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 59-71.</Paragraph>
    <Paragraph position="3"> for data resources (Curmingham,, Freeman, and Black, 1994; Cunningham, 1994)), comparative and task-based evaluation, collaborative research, and software-level robustness, efficiency and portability.</Paragraph>
    <Paragraph position="4"> The design we arrived at in support of these aims is sketched in the rest of this section.</Paragraph>
    <Paragraph position="5"> NLP systems produce information about texts (which may sometimes be the results of automatic speech recognition) and existing systems that aim to provide software infrastructure for NLP can be classifted as belonging to one of three types according to the way in which they treat this information: additive, or markup-based: information produced is added to the text in the form of markup, e.g. in SGML (Thompson and McKelvie, 1996); referential, or annotation-based: information is stored separately with references back to the original text, e.g. in the TIPSTER architecture (Grishman, 1996); abstraction-based: the original text is preserved in processing only as parts of an integrated data structure that represents information about the text in a uniform theoretically-motivated model, e.g. attribute-value structures in the ALEP system (Simkins, 1994).</Paragraph>
    <Paragraph position="6"> A fourth category might be added to cater for those systems that provide communication and control infrastructure without addressing the text-specific needs of NLP (e.g. Verbmobil's ICE architecture (Amtrup, 1995)).</Paragraph>
    <Paragraph position="7"> As noted at a previous conference in this series (Cunningham, Wilks, and Gaizauskas, 1996b), we believe that performance and other considerations favour the referential approach, but also that SGML is a key part of any general text processing strategy. The first design decision we made, then, was to base GATE on a referential core using the TIPSTER architecture, and to cater for SGML via I/O format conversion filters. This led to the development of one of three key pillars of the system: GDM, the GATE Document Manager. GDM and the TIPSTER API that it implements forms a buffer between processing modules in a GATE-based NLP system. Modules no longer talk to each other, with the coherence and coupling implications that direct unrestricted communication can imply, but to GDM via the TIPSTER API.</Paragraph>
    <Paragraph position="8"> One of the key benefits of adopting an explicit architecture for data management is that it becomes possible to easily add a of layer graphical interface access to architecural services and data visualisation tools, and such a layer is our second pillar: GGI, the GATE graphical interface. GGI has functions for creating, viewing and editing the collections of documents which are managed by the GDM and that form the corpora which LE modules and systems in GATE use as input data. It also has facilities to display the results of module or system execution new or changed annotations associated with the document. These annotations can be viewed either in raw form, using a generic annotation viewer, or in an annotation-specific way, if special annotation viewers are available. For example, named entity annotations which identify and classify proper names (e.g. organization names, person names, location names) are shown by colour-coded highlighting of relevant words; phrase structure annotations are shown by graphical presentation of parse trees. Note that the viewers are general for particular types of annotation, so, for example, the same procedure is used for any POS tag set, Named-Entity markup etc. Thus developers reuse GATE data visualisation code with negligible overhead.</Paragraph>
    <Paragraph position="9"> Lastly, the third pillar of the system is the one that does all the real work of processing texts and discovering information about their content: CREOLE, a Collection of REusable Objects for Language Engi= neering. In a sense CREOLE isn't part of GATE at all, but is the set of resources currently integrated with the system, but we also use the term to refer to the mechanismss available for integrating modules into GATE. This process has been automated to a large degree and can be driven from the interface.</Paragraph>
    <Paragraph position="10"> The developer is required to produce some C++ or Tcl code that uses the GDM TIPSTER API to get information from the database and write back results. When the module pre-dates integration, this is called a wrapper as it encapsulates the module in a standard form that GATE expects. When modules are developed specifically for GATE they can embed TIPSTER calls throughout their code and dispense with the wrapper intermediary. The underlying module can be an external executable written in any language (the current CREOLE set includes Prolog, Lisp and Perl programs, for example).</Paragraph>
    <Paragraph position="11"> CREOLE wrappers encapsulate information about the preconditions for a module to run (data that must be present in the GDM database) and post-conditions (data that will result). This information is needed by GGI, and is provided by the developer in a configuration file, which also details what sort of viewer to use for the module's results and any parameters that need passing to the module. These parameters can be changed from the interface at run-time, e.g. to tell Cunningham, Stevenson and Wilks 60 Implementing a Sense Tagger  a parser to use a different lexicon. Aside from the information needed for GGI to provide access to a module, GATE compatibility equals TIPSTER compatibility - i.e. there will be very little overhead in making any TIPSTER module run in GATE.</Paragraph>
    <Paragraph position="12"> Given an integrated module, all other interface functions happen automatically. For example, the module will appear in a graph of all modules available, with permissible links to other modules automatically displayed, having been derived from the module pre- and post-conditions. At any point the developer can create a new graph from a subset of available CREOLE modules to perform a task of specific interest.</Paragraph>
    <Paragraph position="13"> The integration mechanisms also reduce the documentation load: users can reference the TIPSTER API to describe the interchange format of the data they produce and the GATE documentation for integration details. Of course GATE doesn't solve all the problems involved in plugging diverse LE modules together. There are three barriers to such integration: null  * managing storage and exchange of information about texts; * incompatibility of representation of information about texts; * incompatibility of type of information used and  produced by different modules.</Paragraph>
    <Paragraph position="14"> GATE provides a solution to the first two of these, allowing the integrator to concentrate on the core issue of the meaningful content of the information exchanged.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 GATE - practicalities
</SectionTitle>
    <Paragraph position="0"> A main purpose of GGI is to allow execution of the modules within GATE and to provide a graphical access point to the results they produce. Section 3.1 describes the meaning of the primitives in the graph and how it is executed, section 3.2 describes the method used to autogenerate the graph, section 3.3 discusses the method of creating manageable subgraphs, and section 3.4 discusses results visualisation facilities.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Graph Syntax and Semantics
</SectionTitle>
      <Paragraph position="0"> An example of a system graph is shown in figure 12. A system graph is an executable graph, and is ~These and other screen dumps below look better in colour! The description below will be a bit like the TV snooker commentator who said &amp;quot;For those of you watching in black and white, the pink is behind the blue&amp;quot;. a simple data flow program. Modules are shown as nodes in the graph, with the data flow indicated by the arcs. Each incoming arc to a module indicates a dependency on results of previous processing. All modules at the source of arcs connecting to a dependent module must be run before the dependent module is executed, except where the incoming arcs are connected by lines, in which case the module requires the execution of only one of the modules at the other end of the arc (these arcs are then termed or-arcs). Thus, in the example graph of figure 1, the buChart Parser module may only be run if the results of the Gazetteer Lookup module and either the Tagged Morph module or the Morph module are available. They in turn have earlier dependencies.</Paragraph>
      <Paragraph position="1"> The Tokenizer module has no dependencies and so begins execution. There are two modules with no downstream children: MUG-6 Results and MUG-6 NE Results, so either of these must produce an end result. However, because results from modules in the middle of the graph may be of interest to a NLP researcher, any module can be chosen as the final one that will be executed. ,e ~..</Paragraph>
      <Paragraph position="2"> At any point in time, the state of execution of the system, or, more accurately, the availability of data from various modules, is depicted through colourcoding of the module boxes. Figure 1 shows a system window. Light grey modules (green, in the real display) can be executed. Modules that require input from others not yet executed, and so cannot be executed yet, are shown with a white background (amber, in reality). The modules that have already been executed are shown in dark grey (red), at which point their results are available from a menu associated with each box (see below).</Paragraph>
      <Paragraph position="3"> The system graph can either be run in batch mode or in an interactive manner. To run in batch mode, the user selects a path though the graph and clicks on the final module. The current state of the graph, and the document (or collection of documents) currently undergoing execution is shown. The system ensures that the path chosen by the user is valid by only allowing a module to be selected if all its inputs have already been selected. Selected modules are executed in a data driven manner, with modules being executed as soon as their input data is available.</Paragraph>
      <Paragraph position="4"> The interactive mode is designed for module developers. The modules under development can be executed as with the batch mode then the module or modules to be retried (after the underlying code or resources have been changed) can be reset by a mouse click. This clears the database of the post-condition annotations and allows the modules to be rerun.</Paragraph>
      <Paragraph position="5"> Cunningham, Stevenson and Wilks 61 Implementing a Sense Tagger Collection: Jhome/IPete rr/gatelB m'kl~  The nature of-the database (where each module produces a specific set of annotation types) means that it is possible to view partial results of execution without recourse to buffering intermediate data (Woodruff and Stonebreaker, 1995).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Autogeneration
</SectionTitle>
      <Paragraph position="0"> The graph shown in figure 1 is in fact the custom graph. This is the system graph that shows all the modules in the particular GATE environment.</Paragraph>
      <Paragraph position="1"> The custom window is automatically generated from the configuration information that is associated with each module, e.g., for the buChart module:  This data structure (actually a Tcl array (Ousterhout, 1994)) describes the TIPSTER objects that a module requires to run, the objects it produces, and the types of viewers to use for visualising its results. Along with code that uses the TIPSTER API to get information from the database and to store results back there, this configuration file is all that an integrator need produce to connect a module to GATE. Typically the biggest overhead here is converting pre-existing modules to preseve byte-offset information. See (Cunningham et al., 1996) for details. null The autogeneration algorithm creates data flow arcs from modules that have an annotation type in their postconditions to the other modules that have the same annotation type in their precondition. For example, Gazetteer Lookup has the annotation type lookup in its postconditions, so an arc connects it with buChart Parser, which has that annotation type in its preconditions. Arcs are not created between modules that operate on different languages, however in figure 1, all the modules operate on English language documents. When more than one module has the same annotation type in its postcondition then it is assumed that either module may produce the required result, and so the two arcs are or-arcs and are connected by a line (both l~orph and Tagged Horph produce the same annotation and so have or-arcs into buChart Parser).</Paragraph>
      <Paragraph position="2"> The most computationally expensive part of autogeneration goes into discarding redundant arcs. Redundant arcs are those that connect an upstream module to a downstream module where it can be deduced that the preconditions of modules between the two given modules cover the annotation types that the arc represents. For example, the Tokenizer produces annotation types required by buChart Parser, but there is no need for a data flow arc between these modules as modules between them also require these annotation types.</Paragraph>
      <Paragraph position="3"> The autogeneration facility allows easy integration of new modules into GGI. Most NLP tasks can be expressed in the simple data flow techniques of this system, but it is currently not possible to integrate NLP tasks that require iteration.</Paragraph>
      <Paragraph position="4"> Some modules have the same annotation type in both pre- and postconditions. These modify the result of previous computation and pass the data flow Cunningham, Stevenson and Wilks 62 Implementing a Sense Tagger  down stream. This kind of module, termed a filter, cannot be automatically positioned in the diagram, instead the user selects the position of filters from the arcs on which they may appear (arcs from modules that produce the annotation type the filter operates on). During execution filters are treated as normal modules.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Customising Graphs
</SectionTitle>
      <Paragraph position="0"> The system graphs are displayed with a graph drawing tool which is is also used in tree based visualisation tools available for display of e.g. syntactic parse results. This tool allows commands to be associated with nodes, hence it can be used for data flow graphs. It has a layout algorithm based on the method used by daVinci (FrShlich and Werner, 1995) to minimise arc crossing.</Paragraph>
      <Paragraph position="1"> GGI suffers from the scaling problem (Burnett et al., 1987), as the size of the custom graph quickly becomes unmanageable. This can be alleviated by creating new system graphs from specified subgraphs of the custom graph. A later release will allow collapsing of graph sections.</Paragraph>
      <Paragraph position="2"> It is possible to group these derived system graphs together so that the user may chose from a selection of tasks at the top level of GGI (not shown here for space reasons). Having chosen a task (e.g. parsing), an intermediate level display appears, presenting the user with a selection of icons, one for each of the one or more specific systems capable of performing the selected task (e.g. the buChart parser or the Plink parser). Once a particular system is selected, a final window appears displaying the appropriate system graph.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Visuallsing Results
</SectionTitle>
      <Paragraph position="0"> NLP data is wide ranging in scope but has specific characteristics that mean the problems with visualising large amounts of data (Burnett et al., 1987) are less significant. This is because either the information can be visualised as coloured markup on the text (meaning that the text can be displayed using traditional textual techniques (Jonassen, 1982)), or the information is grouped over small segments of text, such as paragraphs or sentences.</Paragraph>
      <Paragraph position="1"> GGI has several viewers for the display of TIPSTER annotations. The viewer for each postcondition annotation is specified by the module configuration file, an example of which i o given in section 3.2. The viewers can be classified into those which display the text and overlay the annotations as colours or shades ('single span', 'multiple span', 'text-attribute'); and those that visualise a more complex relationship between annotations in = * - w~ ~ to the a~ly ~e~sl ~itPSoa of vicul ~ c~ ky la~ Cotla:;, a ~iv~e~y held wa~ l~eVicL~ly ~c,~lde~. a~l C/:~.~ q~at.i~g offi~ ~ese p~eitlens ~'t be filled, la-~d, Larry ~. Harl~.</Paragraph>
      <Paragraph position="2"> ~eviauely e~ive vi~ ~Id~ of ~:$; C/~i~ |~C/ the ~i~ uuit, was n~ tO the ~ly ~eated ~ of ~e~id~ of ~:S: opezatiaa~. ~d alaug with the head ~ Interaatloaal  an acyclic graph format ('tree'). Where no viewer is specified, a default annotation dump is displayed.</Paragraph>
      <Paragraph position="3"> The configuration file for the buChart Parser module in section 3.2 specifies that the 'name' annotation type is assigned the 'single span' viewer, 'syntax' the 'tree' viewer, and 'semantics' the 'raw' or annotation dump viewer. New viewers can be written where the default ones are not appropriate for new annotation types.</Paragraph>
      <Paragraph position="4"> The 'single span' and 'text-attribute' viewers are fairly simple, assigning different colours to each annotation. 'multiple span' is more complex, as it is designed to view annotation chains. An annotation chain is a list of annotations specified by annotation references. The user chooses a highlighted part of the text, and all the other highlights that are part of the same chain are displayed. Figure 2 shows this viewer displaying the results of a coreference task.</Paragraph>
      <Paragraph position="5"> Coreference identifies elements of the text that are interpreted as referring to the same real world entity. For example, a person and a pronoun might be coreferential. In figure 2 the user has chosen one of the highlights referring to 'Richard Bartlett'.</Paragraph>
      <Paragraph position="6"> The 'tree' viewer containing 'syntax' annotations (produced by the buChart Parser) is shown in figure 3. The parse trees currently integrated into GATE span at most a sentence, so that the tree size is always manageable.</Paragraph>
      <Paragraph position="7"> The viewers are activated by first clicking with the mouse on a module whose results are present (i.e. it has been executed and it's box has turned red) which reveals a menu of annotations; choosing an annotation brings up the appropriate viewer.</Paragraph>
      <Paragraph position="8"> There is a certain amount of connectivity between these viewers, as it is possible to click on a node in the parse tree and have the area of text highlighted in a text display window, or it is possible to highlight areas of text and display the raw annotations that are contained within the highlighted span.</Paragraph>
      <Paragraph position="9"> Cunningham, Stevenson and Wilks 63 Implementing a Sense Tagger</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.5 GATE Users
</SectionTitle>
    <Paragraph position="0"> GATE version 1 was released in November 1996 and is in use for a number of projects around the World - see for example hztp: llw~a, sics. se/humle/proj ects/ svensk/svensk, html, who evaluated the system relative to ALEP, and ht~p://wvv, des. shef. ac. uk/research/gr0ups/ nlp/gate/users.html. Figure 4 lists the sites that have licenced the system so far.</Paragraph>
    <Paragraph position="1"> 4 Word sense tagging We have recently implemented a sense tagger within the GATE framework.</Paragraph>
    <Paragraph position="2"> Sense tagging is the process of assigning the appropriate sense from some semantic lexicon to each word 3 in a text. This is similar to the more widely known technology of part-of-speech tagging, but the tags which are assigned in sense tagging are semantic tags from a dictionary rather than the grammatical tags assigned by a part-of-speech tagger.</Paragraph>
    <Paragraph position="3"> Our sense tagger uses the machine readable version of Longman Dictionary of Contemporary English (LDOCE) (Procter, 1978) to provide the semantic tag set. LDOCE is a learners' dictionary one designed not for native speakers of English but for those learning English as a second language and has been used extensively in machine readable dictionary research ((Ide and Veronis, 1994), (Cowie, 3This is often loosened to each content word.</Paragraph>
    <Paragraph position="4"> Guthrie, and Guthrie, 1992), (Bruce and Wiebe, 1994)).</Paragraph>
    <Paragraph position="5"> The clearest way to understand what a sense tagger does is to look at an example of the output we would like it to produce. Consider the sentence &amp;quot;The interest on my bank account accrued over the years&amp;quot;, our tagger should assign a single sense from LDOCE to each of the content words in the sentence. The choice of senses in the assignment should be the same as that a human would choose. An example of a desired assignment is shown in figure 5.</Paragraph>
    <Paragraph position="6"> As can be seen from the senses assigned, each LDOCE sense has a homograph and sense number, these are used to identify different levels of semantic distinction between senses and act as identifying markers. Homograph distinctions signify broad semantic differences between senses (such as the 'edge of river' and 'financial institution' senses of bank) while sense distinctions signify differences between senses which are more related (such as the 'building' and 'company' senses of the word). These numbers are followed by the textual definition of the sense and, possibly, by an example sentence which is a particular use of the sense and is printed in this type 4. The information provided by these tags is potentially valuable for downstream tasks in a language processing system. For example, the system could benefit from knowing that &amp;quot;bank&amp;quot; in this case means 4LDOCE senses have additional information such as subject categories, subcategorisation information and selectional restrictions which we do not show here.</Paragraph>
    <Paragraph position="7"> Cunningham, Stevenson and Wilks 64 Implementing a Sense Tagger</Paragraph>
    <Paragraph position="9"> senses in texts. A natural extension to this observation is to create a disambiguation system which makes use of several of these independent knowledge sources and combines their results in an intelligent way.</Paragraph>
    <Paragraph position="10"> Our system is based on a set of partial taggers, each of which uses a different knowledge source, with their results being combined. Our system is in the tradition of McRoy (McRoy, 1992), who also made use of several knowledge sources for word sense disambiguation, although the information sources she used were not independent, making it difficult to evaluate the contribution of each component. Our system makes use of strictly independent knowledge sources and is implemented within GATE whose plug-and-play architecture makes the evaluation of individual components more straightforward.</Paragraph>
    <Paragraph position="11"> At the moment the sense tagger consists of six stages (shown in figure 6), the first two preprocess the text which is to be disambiguated while the remaining four carry out the disambiguation.</Paragraph>
    <Paragraph position="12">  1. The text is first processed by a named-entity identifier, which we developed as part of Sheffield's entry for MUC-6 (Wakao, Gaizauskas, and Humphries, 1996; Gaizauskas et al., 1996). This identifies certain forms of proper names in the text and classifies them as either place, person, organization or location.</Paragraph>
    <Paragraph position="13"> For details of the classification scheme see (Def, 1995). We make no use of these classifications at present, however, they are of potential use to a module carrying out disambignation using selectional restrictions.</Paragraph>
    <Paragraph position="14"> The tagger does not attempt to disambignate any words which are identified as part of a named-entity.</Paragraph>
    <Paragraph position="15"> 2. The remaining text is stemmed, leaving only morphological roots, and split into sentences.</Paragraph>
    <Paragraph position="16"> Then words belonging to a list of stop words s are removed. The words which have not been identified as part of a named entity or removed because it is a stop word are considered by the system to be ambiguous words and those are the words which are disambignated.</Paragraph>
    <Paragraph position="17"> For each of the ambiguous words, its set of possible senses is extracted from LDOCE and stored. Each sense in LDOCE contains a short textual definition (such as those shown in figure 5) which, when extracted from the dictionary, is processed to remove stop words and stem the remaining words.</Paragraph>
    <Paragraph position="18"> . The text is tagged using the Brill tagger (Brill, 1992) and a translation is carried out using a manually defined mapping from the syntactic tags assigned by Brill (Penn Tree Bank tags (Marcus, Santorini, and Marcinkiewicz, 1993)) onto the simpler part-of-speech categories associated with LDOCE senses 6. We then remove from consideration any of the senses whose part-of-speech is not consistent with the one assigned by the tagger, if none of the senses are consistent with the part-of-speech we assume the tagger has made an error and leave the set of senses for that word unaltered.</Paragraph>
    <Paragraph position="19"> 4. Our next module is based on a proposal by Lesk (Lesk, 1986) that words in a sentence could be disambiguated by choosing the the sense which produced the maximum overlap of the content words in the textual definitions of the word's senses. In practise this led to massive computations with as many as 10 ldeg possible combinations of senses for a single sentence.</Paragraph>
    <Paragraph position="20"> Cowie et. al. (Cowie, Guthrie, and Guthrie, 1992) used simulated annealing (Kirkpatrick, Gelatt, and Vecci, 1983), a numerical optimisation algorithm, to make this process tractable.  The simulated annealing algorithm proceeds by disambiguating a sentence at a time. A random configuration of senses is chosen such that one Sense is assigned to each ambiguous word in the sentence. A score is given to this configuration based on the number of content words which are shared between the textual definition in the senses. Other, random, configurations are then generated and the simulated annealing algorithm is used to optimise over them. When this process is complete the algorithm returns a configuration which assigns the optimal configuration of senses based on the overlap of words in the definition text.</Paragraph>
    <Paragraph position="21"> This process identifies a single condidate LDOCE sense for each ambiguous word.</Paragraph>
    <Paragraph position="22"> 5. The text is then run through a module which optimises the overlap of domain codes for the senses of nouns in each paragraph of the text.</Paragraph>
    <Paragraph position="23"> The optimisation algorithm used is similar to simulated annealing (see section 4), although it has been modified in two ways. Firstly, we maximise the overlap of the pragmatic codes associated with the word senses rather than the content words in their definitions. Secondly, we optimise over entire paragraphs at a time rather than just sentences, this is done because there is good evidence (Gale, Church, and Yarowsky, 1992) that a wide context, of around 100 words, is optimal when disambiguating using domain codes. This process, like the previous module, identifies a single candidate sense for each ambiguous word.</Paragraph>
    <Paragraph position="24"> 6. The final stage is to combine the results of the preceding processes. This is done using a very simple mechanism which we plan to replace with an optimisation algorithm. We assign a score to each of the senses of the ambiguous words. These scores are initialised to 0 and +1 is added to a sense's score for each of the simulated annealing or pragmatic code modules which select that sense. The sense with the highest score is chosen as the tag for each ambiguous word. If there is a tie (two senses with the same score, which will happen if the two partial taggers disagree) it is broken by choosing the first sense, as listed in the dictionary. This is a sensible tie-breaker since the senses are roughly ordered by frequency of occurrence in text 7. After this process is com7We are using the 1st Edition of LDOCE in which the publishers make no claim that the senses are ordered by pleted every ambiguous word has exactly one sense from LDOCE associated with it, this sense is the tag which our system has assigned to that word.</Paragraph>
    <Paragraph position="25"> We have conducted some preliminary testing of our tagger: our tests were run on 14 handdisambiguated (by one of the authors) sentences from the Wall Street Journal, amounting to a 250 word corpus. We found that, of the tokens with more than 1 homograph in LDOCE, 92% were assigned the correct homograph and 75% the correct sense using our tagger. These figures should be compared to the 72% correct homograph assignment and 47% correct sense assignment reported by Cowie et. al. (Cowie, Guthrie, and Guthrie, 1992) using simmulated annealing alone on the same test set.</Paragraph>
    <Paragraph position="26"> 6 Developing the tagger with GATE The sense tagger was implemented as a set of 11 CREOLE modules, 6 of which had been implemented as part of VIE and the remaining 5 were developed specifically for the sense tagger. These were implemented in a variety of programming languages: C\[++\], Perl and Prolog. These five modules are varied in their implementation methods. Two are written entirely in C++ and are linked with the GATE executable at runtime using GATE's dynamic loading facility (see (Cunningham et al., 1996)). Three are made up of a variety of Perl scripts, Prolog saved states or C executables, which are run as external processes via GATE's Tcl (Ousterhout, 1994.) API. This is typical of systems we have seen built using GATE, and illustrates its flexibility with respect to implementation options.</Paragraph>
    <Paragraph position="27"> The GATE graphical representation of the sense tagger is shown in figure 7.</Paragraph>
    <Paragraph position="28"> A special viewer was implemented within GATE to display the results of the sense tagging process. After the final module in the tagger has been run it is possible to call a viewer which displays the text which has been processed with the ambiguous words highlighted (see figure 8). Clicking on one of these highlighted words causes another window to appear which contains the sense which has been assigned to that word by the tagger (see figure 9). Using this viewer we can quickly see that the tagger has assigned the 'chosen for job' sense of &amp;quot;appointment&amp;quot; in &amp;quot;Kando, whose appointment takes effect from today ...&amp;quot; which is the correct sense in this context. frequency of occurrence in text (although they do in later editions). However, (Guo, 1989) has found evidence that there is a correspondence between the order in which sense are listed and the frequency of occurrence.</Paragraph>
    <Paragraph position="29"> Cunningham, Stevenson and Wilks 67 Implementing a Sense Tagger  the rapid re-use of existing modules and reduced the need to provide data-transfer routes between modules. Almost the entire preprocessing of the text was carried out using modules which had already been implemented within GATE: the tokeniser, sentence splitter, Brill part-of-speech tagger and the modules which made up the Named Entity identifier. This meant that we could quickly implementat the modules which carried out the disambiguation, and those were the modules in which we were most interested. The implementation was further speeded up by the use of results viewers which allowed us to examine the annotations in the TIPSTER DataBase after a module had been run, allowing us to discover bugs far more quickly than would have been possible in a system which is not as explicitly modular as GATE. One aspect of sense tagging in which we are interested is the effect of including and excluding different modules, and this could be easily carried out using GGI.</Paragraph>
    <Paragraph position="30"> One particular limitation of the current GATE implementation became apparent during this work, viz. the necessity of cascading module reset in the presence of non-monotonic database updates. For example, the POS filter modules remove some of the sense definitions associated with words by the lexical pre-processing stages. When reseting these modules it is therefore necessary to reset the preprocessor stage in order that the database is returned to a consistent state (this is done automatically by GATE, which identifies cases where modules alter previously existing annotations by examination of the pre-/postconditions of the module supplied by the developer as configuration information prior to loading). This leads to redundant processing, and in the case of slow modules (like our LDOCE lookup module) this can be an appreciable brake on the development cycle. The planned solution is to change the implmentation of the reset function. Currently this simply deletes the database objects created by a module. Given a database implementation that supports transactions we can use timestamp and rollback for a more intelligent reset, and avoid the redundant processing caused by reset cascading.</Paragraph>
    <Paragraph position="31"> An additional, lesser problem, is the complexity of the generation algorithms for the task graphs, and the diffculty of managing these graphs as the number of modules in the system grows. The graphs currently make two main contributions to the system: they give a graphical representation of control flow, and allow the user to manipulate execution of modules; they give a graphical entry point to results visualisation. These benefits will have to be balanced against their disadvantages in future versions of the system. Another problem may arise when the architecture includes facilities for distributed processing (Zajac et al., 1997; Zajac, 1997), as it is not obvious how the linear model currently embodied in the graphs could be extended to support non-linear control strucures.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML