XML Viewer - w03-0801

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0801_metho.xml
Size: 20,051 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0801">
  <Title>The Talent System: TEXTRACT Architecture and Data Model</Title>
  <Section position="2" start_page="10598" end_page="10598" type="metho">
    <SectionTitle>
3 Different Operational Environments
</SectionTitle>
    <Paragraph position="0"> For the purposes of interactive (re-)configuration of TEXTRACT's processing chain, rapid application prototyping, and incremental plugin functionality development, the system's underlying infrastructure capabilities are available to a graphical interface. This allows conttrol over individual plugins; in particular, it exploits the configuration object to dynamically reconfigure specified plugins on demand. By exposing access to the common analysis substrate and the document object, and by exploiting a mechanism for declaring, and interpreting, dependencies among individual plugins, the interface further offers functionality similar to that of GATE (Cunningham, 2002). Such functionality is facilitated by suitable annotation repository methods, including a provision for 'rolling back' the repository to an earlier state, without a complete system reInit().</Paragraph>
  </Section>
  <Section position="3" start_page="10598" end_page="10598" type="metho">
    <SectionTitle>
4 The TEXTRACT Data Model
</SectionTitle>
    <Paragraph position="0"> The plugins and applications communicate via the annotations, vocabulary, and the lexical cache. The collection object owns the lexical cache; the document object contains the other two subsystems: the annotation repository, and the document vocabulary. Shared read-only resources are managed by the resource manager.</Paragraph>
    <Paragraph position="1"> Annotations: Annotations contain, minimally, the character locations of the beginning and ending position of the annotated text within the base document, along with the type of the annotation. Types are organized into families: lexical, syntactic, document structure, discourse, and markup. The markup family provides access to the text buffer, generally only used by the tokenizer. The annotation repository owns the type system and pre-populates it at startup time. Annotation features vary according to the type; for example, position in a hierarchy of vocabulary categories (e.g. Person, Org) is a feature of lexical annotations. New types and features (but not new families) can be added dynamically by any system component. The annotation repository has a container of annotations ordered on start location (ascending), end location (descending), priority of type family (descending), priority within type family (descending), and type name (ascending). The general effect of the family and type priority order is to reflect nesting level in cases where there are multiple annotations at different levels with the same span. With this priority, an annotation iterator will always return an NP In addition, the GUI is configurable as a development environment for finite state (FS) grammar writing and debugging, offering native grammar editing and compilation, contextualized visualization of FS matching, and in-process inspection of the annotation repository at arbitrary level of granularity. Figure 2 is broadly indicative of some of the functional components exposed: in particular, it exemplifies a working context for a grammar writer, which includes an interface for setting operational parameters, a grammar editor/compiler, and multiple viewers for the results of the pattern match, mediated via the annotation repository, and making use of different presentation perspectives (e.g. a parse tree for structural analysis, concordance for pattern matching, and so forth.) (noun phrase) annotation before a covered word annotation, no matter how many words are in the NP.</Paragraph>
    <Paragraph position="2"> Iterators over annotations can move forward and backward with respect to this general order. Iterators can be filtered by set of annotation families, types or a specified text location. A particular type of filtered iterator is the subiterator, an iterator that covers the span of a given annotation (leaving out the given annotation).</Paragraph>
    <Paragraph position="3"> Iterators can be specified to be &amp;quot;ambiguous&amp;quot; or &amp;quot;unambiguous.&amp;quot; Ambiguous scans return all the annotations encountered; unambiguous scans return only a single annotation covering each position in the document, the choice being made according to the sort order above.</Paragraph>
    <Paragraph position="4"> Unambiguous scans within family are most useful for retrieving just the highest order of analysis. All the different kinds of filters can be specified in any combination. null Lexical Cache: One of the features on a word annotation is a reference to an entry in the lexical cache.</Paragraph>
    <Paragraph position="5"> The cache contains one entry for each unique token in the text that contains at least one alphabetic character.</Paragraph>
    <Paragraph position="6"> Initially designed to improve performance of lexical lookup, the cache has become a central location for authority information about tokens, whatever the source: lexicon, stop word list, gazetteer, tagger model etc. The default lifetime of the lexical cache is the collection; however, performance can be traded for memory by a periodic cache refresh.</Paragraph>
    <Paragraph position="7"> The lexical lookup (lexalyzer) plugin populates the lexical cache with tokens, their lemma forms, and morpho-syntactic features. Morpho-syntactic features are encoded in an interchange format which mediates among notations of different granularities (of syntactic feature distinctions or morphological ambiguity), used by dictionaries (we use the IBM LanguageWare dictionaries, available for over 30 languages), tag sets, and finite state grammar symbols. In principle, different plugins running together can use different tag sets by defining appropriate tagset mapping tables via a configuration file. Similarly, a different grammar morpho-syntactic symbol set can also be externally defined. As with annotations, an arbitrary number of additional features can be specified, on the fly, for tokens and/or lemma forms. For example, an indexer for domain terminology cross references different spellings, as well as misspellings, of the same thing. The API to the lexical cache also provides an automatic pass-through to the dictionary API, so that any plugin can look up a string that is not in the text and have it placed in the cache.</Paragraph>
    <Paragraph position="8"> Vocabulary: Vocabulary annotations (names, domain terms, abbreviations) have a reference to an entry in the vocabulary. The canonical forms, variants, and categories in the vocabulary can be plugin-discovered (Nominator), or plugin-recovered (matched from an authority resource, such as a glossary). Collection salience statistics (e.g. tfxidf), needed, for example, by the summarizer application, are populated from a resource derived from an earlier collection run. As with the annotations and lexical entries, a plugin may define new features on the fly.</Paragraph>
    <Paragraph position="9"> Resource Manager: The Resource Manager, implemented as a C++ singleton object so as to be available to any component anywhere, manages the files and API's of an eclectic collection of shared read-only resources: a names authority data base (gazetteer), prefix and suffix lists, stop word lists, the IBM LanguageWare dictionaries with their many functions (lemmatization, morphological lookup, synonyms, spelling verification, and spelling correction), and, for use in the research environment, WordNet (Fellbaum, 1998). The API wrappers for the resources are deliberately not uniform, to allow rapid absorption and reuse of components. For performance, the results of lookup in these resources are cached as features in the lexical cache or vocabulary.</Paragraph>
  </Section>
  <Section position="4" start_page="10598" end_page="10598" type="metho">
    <SectionTitle>
TEXTRACT Plugins
</SectionTitle>
    <Paragraph position="0"> TEXTRACT plugins and applications need only to conform to the API of the plugin manager, which cycles through the plugin vector with methods for: construct(), initialize(), processDocument(), and endDocument(). Collection applications and plugins look nearly the same to the plugin manager; they have, additionally, startCollection() and endCollection() methods. The complete API also includes the interfaces to the annotation repository, lexical cache, and vocabulary.</Paragraph>
    <Paragraph position="1">  Numerous NLP applications today deploy finite state (FS) processing techniques--for, among other things, efficiency of processing, perspicuity of representation, rapid prototyping, and grammar reusability (see, for instance, Karttunen et al., 1996; Kornai, 1999). TEXTRACT's FS transducer plugin (henceforth TFST), encapsulates FS matching and transduction capabilities and makes these available for independent development of grammar-based linguistic filters and processors.</Paragraph>
    <Paragraph position="2"> In a pipelined architecture, and in an environment designed to facilitate and promote reusability, there are some questions about the underlying data stream over which the FS machinery operates, as well as about the mechanisms for making the infrastructure components--in particular the annotation repository and shared resources--available to the grammar writer.</Paragraph>
    <Paragraph position="3"> Given that the document character buffer logically 'disappears' from a plugin's point of view, FS operations now have to be defined over annotations and their properties. This necessitates the design of a notation, in which grammars can be written with reference to TEXTRACT's underlying data model, and which still have access to the full complement of methods for manipulating annotations.</Paragraph>
    <Paragraph position="4"> In the extreme, what is required is an environment for FS calculus over typed feature structures (see Becker et al., 2002), with pattern-action rules where patterns would be specified over type configurations, and actions would manipulate annotation types in the annotation repository. Manipulation of annotations from FS specifications is also done in other annotation-based text processing architectures (see, for instance, the JAPE system; Cunningham et al, 2000). However, this is typically achieved, as JAPE does, by allowing for code fragments on the right-hand side of the rules.</Paragraph>
    <Paragraph position="5"> Both assumptions--that a grammar writer would be familiar with the complete type system employed by all 'upstream' (and possibly third party) plugins, and that a grammar writer would be knowledgeable enough to deploy raw API's to the annotation repository and resource manager--go against the grain of TEXTRACT's design philosophy.</Paragraph>
    <Paragraph position="6"> Consequently, we make use of an abstraction layer between an annotation representation (as it is implemented) and a set of annotation property specifications which define individual plugin capabilities and granularity of analysis. We also have developed a notation for FS operations, which appeals to the system-wide set of annotation families, with their property attributes, as well as encapsulates operations over annotations--such as create new ones, remove existing ones, modify and/or add properties, and so forth--as primitive operations.</Paragraph>
    <Paragraph position="7"> Note that the abstraction hides from the grammar writer system-wide design decisions, which separate the annotation repository, the lexicon, and the vocabulary (see Section 3 above): thus, for instance, access to lexical resources with morpho-syntactic information, or, indeed, to external repositories like gazetteers or lexical databases, appears to the grammar writer as querying an annotation with morpho-syntactic properties and attribute values; similarly, a rule can post a new vocabulary item using notational devices identical to those for posting annotations.</Paragraph>
    <Paragraph position="8"> The freedom to define, and post, new annotation types 'on the fly' places certain requirements on the FST subsystem. In particular, it is necessary to infer how new annotations and their attributes fit into an already instantiated data model. The FST plugin therefore incorporates logic in its reInit() method which scans an FST file (itself generated by an FST compiler typically running in the background), and determines-by deferring to a symbol compiler--what new annotation types and attribute features need to be dynamically configured and incrementally added to the model.</Paragraph>
    <Paragraph position="9"> An annotation-based regime of FS matching needs a mechanism for picking a particular path through the input annotation lattice, over which a rule should be applied: thus, for instance, some grammars would inspect raw tokens, others would abstract over vocabulary items (some of which would cover multiple tokens), yet others might traffic in constituent phrasal units (with an additional constrain over phrase type) or/and document structure elements (such as section titles, sentences, and so forth).</Paragraph>
    <Paragraph position="10"> For grammars which examine uniform annotation types, it is relatively straightforward to infer, and construct (for the run-time FS interpreter), an iterator over such a type (in this example, sentences). However, expressive and powerful FS grammars may be written which inspect, at different--or even the same--point of the analysis annotations of different types. In this case it is essential that the appropriate iterators get constructed, and composed, so that a felicitous annotation stream gets submitted to the run-time for inspection; TEXTRACT deploys a special dual-level iterator designed expressly for this purpose.</Paragraph>
    <Paragraph position="11"> Additional features of the TFST subsystem allow for seamless integration of character-based regular expression matching, morpho-syntactic abstraction from the underlying lexicon representation and part-of-speech tagset, composition of complex attribute specification from simple feature tests, and the ability to constrain rule application within the boundaries of specified annotation types only. This allows for the easy specification, via the grammar rules, of a variety of matching regimes which can transparently query upstream annotators of which only the externally published capabilities are known.</Paragraph>
    <Paragraph position="12"> A number of applications utilizing TFST include a shallow parser (Boguraev, 2000), a front end to a glossary identification tool (Park et al., 2002), a parser for temporal expressions, a named entity recognition device, and a tool for extracting hypernym relations.</Paragraph>
  </Section>
  <Section position="5" start_page="10598" end_page="10598" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> The Talent system, and TEXTRACT in particular, belongs to a family of language engineering systems which includes GATE (University of Sheffield), Alembic</Paragraph>
  </Section>
  <Section position="6" start_page="10598" end_page="10598" type="metho">
    <SectionTitle>
(MITRE Corporation), ATLAS (University of Pennsyl-
</SectionTitle>
    <Paragraph position="0"> vania), among others. Talent is perhaps closest in spirit to GATE. In Cunningham, et al. (1997), GATE is described as &amp;quot;a software infrastructure on top of which heterogeneous NLP processing modules may be evaluated and refined individually or may be combined into larger application systems.&amp;quot; Thus, both Talent and GATE address the needs of researchers and developers, on the one hand, and of application builders, on the other.</Paragraph>
    <Paragraph position="1"> The GATE system architecture comprises three components: The GATE Document Manager (GDM), The Collection of Reusable Objects for Language Engineering (CREOLE), and the GATE Graphical Interface (GGI). GDM, which corresponds to TEXTRACT's driver, engine, and plugin manager, is responsible for managing the storage and transmission (via APIs) of the annotations created and manipulated by the NLP processing modules in CREOLE. In TEXTRACT's terms, the GDM is responsible for the data model kept in the document and collection objects. Second, CREOLE is the GATE component model and corresponds to the set of TEXTRACT plugins. Cunningham, et al. (1997) emphasize that CREOLE modules, which can encapsulate both algorithmic and data resources, are mainly created by wrapping preexisting code to meet the GDM APIs.</Paragraph>
    <Paragraph position="2"> In contrast, TEXTRACT plugins are typically written expressly in order that they may directly manipulate the analyses in the TEXTRACT data model. According to Cunningham, et al. (2001), available CREOLE modules include: tokenizer, lemmatizer, gazetteer and name lookup, sentence splitter, POS tagger, and a grammar application module, called JAPE, which corresponds to TEXTRACT's TFST. Finally, GATE's third component, GGI, is the graphical tool which supports configuration and invocation of GDM and CREOLE for accomplishing analysis tasks. This component is closest to TEXTRACT's graphical user interface. As discussed earlier, the GUI is used primarily as a tool for grammar development and AR inspection during grammar writing. Most application uses of TEXTRACT are accomplished with the programming APIs and configuration tools, rather than with the graphical tool.</Paragraph>
    <Paragraph position="3"> Most language engineering systems in the TEXTRACT family have been motivated by a particular set of applications: semi-automated, mixed-initiative annotation of linguistic material for corpus construction and interchange, and for NLP system creation and evaluation, particularly in machine-learning contexts.</Paragraph>
    <Paragraph position="4"> As a result, such systems generally highlight graphical user interfaces, for visualizing and manipulating annotations, and file formats, for exporting annotations to other systems. Alembic (MITRE, 1997) and ATLAS (Bird, et al., 2000) belong to this group. Alembic, built for participation in the MUC conferences and adhering to the TIPSTER API (Grishman, 1996), incorporates automated annotators (&amp;quot;plugins&amp;quot;) for word/sentence tokenization, part-of-speech tagging, person/ organization/ location/ date recognition, and coreference analysis. It also provides a phrase rule interpreter similar to TFST. Alembic incorporates ATLAS's &amp;quot;annotation graphs&amp;quot; as its logical representation for annotations. Annotation graphs reside in &amp;quot;annotation sets,&amp;quot; which are closest in spirit to TEXTRACT's annotation repository, although they don't apparently provide APIs for fine-grained manipulation of, and filtered iterations over, the stored annotations. Rather, ATLAS exports physical representations of annotation sets as XML files or relational data bases containing stand-off annotations, which may then be processed by external applications.</Paragraph>
    <Paragraph position="5"> Other systems in this genre are Anvil (Vintar and Kipp (2001), LT-XML (Brew, et al., 2000), MATE (McKelvie, et al., 2000), and Transcriber (Barras, et al., (2001). Like ATLAS, some of these were originally built for processing speech corpora and have been extended for handling text. With the exception of GATE, all of these systems are devoted mainly to semi-automated corpus annotation and to evaluation of language technology, rather than to the construction of industrial NLP systems, which is TEXTRACT's focus.</Paragraph>
    <Paragraph position="6"> As a result, TEXTRACT uses a homogeneous implementation style for its annotation and application plugins, with a tight coupling to the underlying shared analysis data model. This is in contrast to the more looselycoupled heterogeneous plugin and application model used by the other systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML