File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1056_intro.xml
Size: 8,218 bytes
Last Modified: 2025-10-06 14:01:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1056"> <Title>An Integrated Architecture for Shallow and Deep Processing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Architecture </SectionTitle> <Paragraph position="0"> The WHITEBOARD architecture defines a platform that integrates the different NLP components by enriching an input document through XML annotations. XML is used as a uniform way of representing and keeping all results of the various processing components and to support a transparent software infrastructure for LT-based applications. It is known that interesting linguistic information --especially when considering DNLP-- cannot efficiently be represented within the basic XML markup framework (&quot;typed parentheses structure&quot;), e.g., linguistic phenomena like coreferences, ambiguous readings, and discontinuous constituents. The WHITEBOARD architecture employs a distributed multi-level representation of different annotations. Instead of translating all complex structures into one XML document, they are stored in different annotation layers (possibly non-XML, e.g. feature structures). Hyperlinks and &quot;span&quot; information together support efficient access between layers. Linguistic information of common interest (e.g. constituent structure extracted from HPSG feature structures) is available in XML format with hyperlinks to full feature structure representations externally stored in corresponding data files.</Paragraph> <Paragraph position="1"> Fig. 1 gives an overview of the architecture of the WHITEBOARD Annotation Machine (WHAM).</Paragraph> <Paragraph position="2"> Applications feed the WHAM with input texts and a specification describing the components and configuration options requested. The core WHAM engine has an XML markup storage (external &quot;offline&quot; representation), and an internal &quot;online&quot; multi-level annotation chart (index-sequential access). Following the trichotomy of NLP data representation models in (Cunningham et al., 1997), the XML markup contains additive information, while the multi-level chart contains positional and abstraction-based information, e.g., feature structures representing NLP entities in a uniform, linguistically motivated form.</Paragraph> <Paragraph position="3"> Applications and the integrated components access the WHAM results through an object-oriented programming (OOP) interface which is designed as general as possible in order to abstract from component-specific details (but preserving shallow and deep paradigms). The interfaces of the actually integrated components form subclasses of the generic interface. New components can be integrated by implementing this interface and specifying DTDs and/or transformation rules for the chart.</Paragraph> <Paragraph position="4"> The OOP interface consists of iterators that walk through the different annotation levels (e.g., token spans, sentences), reference and seek operators that allow to switch to corresponding annotations on a different level (e.g., give all tokens of the current sentence, or move to next named entity starting from a given token position), and accessor methods that return the linguistic information contained in the chart. Similarily, general methods support navigating the type system and feature structures of the DNLP components. The resulting output of the WHAM can be accessed via the OOP interface or as XML markup.</Paragraph> <Paragraph position="5"> The WHAM interface operations are not only used to implement NLP component-based applications, but also for the integration of deep and shallow processing components itself.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Components </SectionTitle> <Paragraph position="0"> Shallow analysis is performed by SPPC, a rule-based system which consists of a cascade of weighted finite-state components responsible for performing subsequent steps of the linguistic analysis, including: fine-grained tokenization, lexicomorphological analysis, part-of-speech filtering, named entity (NE) recognition, sentence boundary detection, chunk and subclause recognition, see (Piskorski and Neumann, 2000; Neumann and Piskorski, 2002) for details. SPPC is capable of processing vast amounts of textual data robustly and efficiently (ca. 30,000 words per second in standard PC environment). We will briefly describe the SPPC components which are currently integrated with the deep components.</Paragraph> <Paragraph position="1"> Each token identified by a tokenizer as a potential word form is morphologically analyzed. For each token, its lexical information (list of valid readings including stem, part-of-speech and inflection information) is computed using a fullform lexicon of about 700,000 entries that has been compiled out from a stem lexicon of about 120,000 lemmas. After morphological processing, POS disambiguation rules are applied which compute a preferred reading for each token, while the deep components can back off to all readings. NE recognition is based on simple pattern matching techniques. Proper names (organizations, persons, locations), temporal expressions and quantities can be recognized with an average precision of almost 96% and recall of 85%.</Paragraph> <Paragraph position="2"> Furthermore, a NE-specific reference resolution is performed through the use of a dynamic lexicon which stores abbreviated variants of previously recognized named entities. Finally, the system splits the text into sentences by applying only few, but highly accurate contextual rules for filtering implausible punctuation signs. These rules benefit directly from NE recognition which already performs restricted punctuation disambiguation.</Paragraph> <Paragraph position="3"> The HPSG Grammar is based on a large-scale grammar for German (M&quot;uller, 1999), which was further developed in the VERBMOBIL project for translation of spoken language (M&quot;uller and Kasper, 2000). After VERBMOBIL the grammar was adapted to the requirements of the LKB/PET system (Copestake, 1999), and to written text, i.e., extended with constructions like free relative clauses that were irrelevant in the VERBMOBIL scenario.</Paragraph> <Paragraph position="4"> The grammar consists of a rich hierarchy of 5,069 lexical and phrasal types. The core grammar contains 23 rule schemata, 7 special verb movement rules, and 17 domain specific rules. All rule schemata are unary or binary branching. The lexicon contains 38,549 stem entries, from which more than 70% were semi-automatically acquired from the annotated NEGRA corpus (Brants et al., 1999).</Paragraph> <Paragraph position="5"> The grammar parses full sentences, but also other kinds of maximal projections. In cases where no full analysis of the input can be provided, analyses of fragments are handed over to subsequent modules.</Paragraph> <Paragraph position="6"> Such fragments consist of maximal projections or single words.</Paragraph> <Paragraph position="7"> The HPSG analysis system currently integrated in the WHITEBOARD system is PET (Callmeier, 2000). Initially, PET was built to experiment with different techniques and strategies to process unification-based grammars. The resulting system provides efficient implementations of the best known techniques for unification and parsing.</Paragraph> <Paragraph position="8"> As an experimental system, the original design lacked open interfaces for flexible integration with external components. For instance, in the beginning of the WHITEBOARD project the system only accepted fullform lexica and string input. In collaboration with Ulrich Callmeier the system was extended. Instead of single word input, input items can now be complex, overlapping and ambiguous, i.e. essentially word graphs. We added dynamic creation of atomic type symbols, e.g., to be able to add arbitrary symbols to feature structures. With these enhancements, it is possible to build flexible interfaces to external components like morphology, tokenization, named entity recognition, etc.</Paragraph> </Section> </Section> class="xml-element"></Paper>