File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/a97-1031_intro.xml
Size: 6,808 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1031"> <Title>An Information Extraction Core System for Real World German Text Processing</Title> <Section position="3" start_page="0" end_page="210" type="intro"> <SectionTitle> *DFKI GmbH, Stuhlsatzenhausweg 3, 66123 </SectionTitle> <Paragraph position="0"> thereby provide an easy means to skip text without deep analysis.</Paragraph> <Paragraph position="1"> The majority of existing information systems are applied to English text. A major drawback of previous systems was their restrictive degree of portability towards new domains and tasks which was also caused by a restricted degree of re-usability of the knowledge sources. Consequently, the major goals which were identified during the sixth message understanding conference (MUC-6) were. on the one hand, to demonstrate task-independent component technologies of information extraction, and, on the other hand, to encourage work on increasing portability and &quot;deeper understanding&quot; (of. (Grishman and Sundheim, 1996)).</Paragraph> <Paragraph position="2"> In this paper we report on SMES an information extraction core system for real world German text processing. The main research topics we are concerned with include easy portability and adaptability of the core system to extraction tasks of different complexity and domains. In this paper we will concentrate on the technical and implementational aspects of the IE core technology used for achieving the desired portability. We will only briefly describe some of the current applications built on top of this core machinery (see section 7).</Paragraph> <Paragraph position="3"> 2 The overall architecture of SMES The basic design criterion of the SMES system is to pr(r,-:de a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.</Paragraph> <Paragraph position="4"> Hence, we view SMES as a core information extraction system. Customization is achieved in tile following directions: * defining the flow of control between modules (e.g., cascaded and/or interleaved) Figure 1 shows a blueprint of the core system (which roughly follows the design criteria of the generic information extraction system described in (Hobbs, 1992)) . The main components are: A tokenizer based on regular expressions: it scans an ASCII text file for recognizing text structure, special tokens like date and time expressions, abbreviations and words.</Paragraph> <Paragraph position="5"> A very efficient and robust German morphological component which performs morphological infection and compound processing. For each analyzed word it returns a (set of) triple containing the stem (or a list of stems in case of a compound), the part of speech, and inflectional information. Disambiguation of the morphological output is performed by a set of word-case sensitive rules, and a Brill-based unsupervised tagger.</Paragraph> <Paragraph position="6"> A declarative specification tool for expressing finite state grammars for handling word groups and phrasal entities (e.g., general NPs, PPs, or verb groups, complex time and date expressions, proper name expressions). A finite state grammar consists of a set of fragment extraction patterns defined as finite state transducers (FST), where modularity is achieved through a generic input/output device. FST are compiled to Lisp functions using an extended version of the compiler defined in (Krieger, 1987).</Paragraph> <Paragraph position="7"> A bidirectional lexical-driven shallow parser for the combination of extracted fragments. Shallow parsing is basically directed through fragment combination patterns FCP of the form (FSTtelt, anchor, FSTright), where anchor is a lexical entry (e.g., a verb like &quot;to meet&quot;) or a name of a class of lexical entries (e.g., &quot;transitive-verb&quot;). FCPs are attached to lexical entries (e.g., verbs), and are selected right after a corresponding lexical entry has been identified. They are applied to their left and right stream of tokens of recognized fragments.</Paragraph> <Paragraph position="8"> The fragment combiner is used for recognizing and extracting clause level expressions, as well as for the instantiation of templates.</Paragraph> <Paragraph position="9"> An interface to TDL, a type description language for constraint-based grammars (Krieger and Sch~fer, 1994). TDL is used in SMES for performing type-driven lexical retrieval, e.g., for concept-driven filtering, and for the evaluation of syntactic agreement tests during fragment processing and combination.</Paragraph> <Paragraph position="10"> The knowledge base is the collection of different knowledge sources, viz. lexicon, subgrammars, clause-level expressions, and template patterns. Currently it includes 120.000 lexical root entries, subgra.mmars for simple and complex date and time expressions, person names, company names, currency expressions, as well as shallow grammars for general nominal phrases, prepositional phrases, and general verb-modifier expressions.</Paragraph> <Paragraph position="11"> Additionally to the above mentioned components there also exists a generic graphical editor for text items and an HTML interface to the Netscape browser which performs marking of the relevant text parts by providing typed parentheses which also serve as links to the internal representation of the extracted information.</Paragraph> <Paragraph position="12"> There are two important properties of the system for supporting portability: * Each component outputs the resulting structures uniformly as feature value structures, together with its type and the corresponding start and end positions of the spanned input expressions. We call these output structures text items.</Paragraph> <Paragraph position="13"> * All (un-filtered) resulting structures of each component are cached so that a component can take into account results of all previous components. This allows for the definition of cascaded as well as interleaved flow of control. The former case means that it is possible to apply a cascade of finite state expressions (comparable to that proposed in (Appelt et al., 1993))~ and the latter supports the definition of finite state expressions which incrementally perform a mix of keyword spotting, fragment processing: and The system has already successfully been applied to classifying event announcements made via email, scheduling of meetings also sent via email, and extraction of company information from on-line newswires (see 7 for more details). In the next section, we are describing some of the components' properties in more detail.</Paragraph> </Section> class="xml-element"></Paper>