File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1031_metho.xml
Size: 27,819 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1031"> <Title>An Information Extraction Core System for Real World German Text Processing</Title> <Section position="4" start_page="210" end_page="210" type="metho"> <SectionTitle> 3 Word level processing </SectionTitle> <Paragraph position="0"> Text scanning Each file is firstly preprocessed by the text scanner. Applying regular expressions (the text scanner is implemented in lex, the well-known Unix tool), the text scanner identifies some text structure (e.g., paragraphs, indentations), word, number, date and time tokens (e.g, &quot;1.3.96&quot;, &quot;12:00 h&quot;), and expands abbreviations. The output of the text scanner is a stream of tokens, where each word is simply represented as a string of alphabetic characters (including delimiters, e.g. &quot;Daimler-Benz&quot;). Number, date and time expressions are normalized and represented as attribute values structures. For example the character stream &quot;1.3.96&quot; is represented as (:date ((:day l)(:mon 3)(:year 96)), and &quot;13:15 h&quot; as (:time ((:hour 13)(:min 15))).</Paragraph> <Paragraph position="1"> Morphological processing follows text scanning and performs inflection, and processing of compounds. The capability of efficiently processing compounds is crucial since compounding is a very productive process of the German language.</Paragraph> <Paragraph position="2"> The morphological component called MONA is a descendant of MORPHIX, a fast classification-based morphology component for German (Finkler and Neumann, 1988). MONA improves MORPHIX in that the classification-based approach has been combined with the well-known two-level approach, originally developed by (Koskenniemi, 1983). Actually, the extensions concern * the use of tries (see (Aho et al., 1983)) as the sole storage device for all sorts of lexical information in MONA (e.g., for lexical entries, prefix, inflectional endings) , and * the analysis of compound expressions which is realized by means of a recursive trie traversal.</Paragraph> <Paragraph position="3"> During traversal two-level rules are applied for recognizing linguistically well-formed decompositions of the word form in question.</Paragraph> <Paragraph position="4"> The output of MONA is the word form together with all its readings. A reading is a triple of the form (stem, inflection,pos), where stem is a string or a list of strings (in the case of compounds), inflection is the inflectional information, and pos is the part of speech.</Paragraph> <Paragraph position="5"> Currently, MONA is used for the German and Italian language. The German version has a very broad control flow.</Paragraph> <Paragraph position="6"> coverage (a lexicon of more then 120.000 stem entries), and an excellent speed (5000 words/sec without compound handling, 2800 words/sec with compound processing (where for each compound all lexically possible decompositions are computed). 2 Part-of-speech disambiguation Morphological ambiguous readings are disambiguated wrt. part-of-speech using case-sensitive rules 3 and filtering rules which have been determined using Brill's un-supervised tagger (Brill, 1995). The filtering rules are also used for tagging unknown words.</Paragraph> <Paragraph position="7"> The filtering rules are determined on the basis of unannotated corpora. Starting from untagged corpora, MONA is used for initial tagging, where unknown words are ambiguously tagged as noun, verb, and adjective. Then, using contextual information from unambiguously analysed word forms, filter rules are determined which are of the form change tag of word form from noun or verb to noun if the previous word is a determiner.</Paragraph> <Paragraph position="8"> First experiments using a training set of 100.000 words and a set of about 280 learned filter rules yields a tagging accuracy (including tagging of unknown words) of 91.4%. 4 Note that the un-supervised tagger required no hand-tagged corpora and considered unknown words. We expect to increase the accuracy by improving the un-supervised tagger through the use of more linguistic information determined by MONA especially for the case of unknowns words.</Paragraph> </Section> <Section position="5" start_page="210" end_page="213" type="metho"> <SectionTitle> 4 Fragment processing </SectionTitle> <Paragraph position="0"> Word group recognition and extraction is performed through fragment extraction patterns which are expressed as finite state transducers (FST) and which are compiled to Lisp functions using a compiler based on (Krieger, 1987). An FST consists of a unique name, the recognition part, the output description, and a set of compiler parameters.</Paragraph> <Paragraph position="1"> The recognition part An FST operates on a stream of tokens. The recognition part of an FST is used for describing regular patterns over such token 2Measurement has been performed on a Sun 20 using an on-line lexicon of 120.000 entries.</Paragraph> <Paragraph position="2"> 3Generally, only nouns (and proper names) axe written in standard German with an capitalized initial letter (e.g., &quot;der Wagen&quot; the car vs. &quot;wit wagen&quot; we venture). Since typing errors are relatively rare in press releases (or similar documents) the application of case-sensitive rules are a reliable and straightforward tagging means for the German language.</Paragraph> <Paragraph position="3"> 4Brill reports a 96% accuracy using a training set of 350.000 words and 1729 rules. However, he does not handle unknown words. In (Aone and Hausman, 1996), an extended version of Brill's tagger is used for tagging Spanish texts, which includes unknown words. They report an accuracy of 92.1%.</Paragraph> <Paragraph position="4"> streams. For supporting modularity the different possible kind of tokens are handled via basic edges, where a basic edge can be viewed as a predicate for a specific class of tokens. More precisely a basic edge is a tuple of the form (name, test, variable), where name is the name of the edge, test is a predicate, and variable holds the current token Tc , if test applied on Tc holds. For example the following basic edge (:mona-cat &quot;partikel&quot; pre) tests whether Tc produced by MONA is a particle, and if so binds the token to the variable pre (more precisely, each variable of a basic edge denotes a stack, so that the current token is actually pushed onto the stack).</Paragraph> <Paragraph position="5"> We assume that for each component of the system for which fragment extraction patterns are to be defined, a set of basic edges exists. Furthermore, we assume that such a set of basic edges remains fix at some point in the development of the system and thus can be re-used as pre-specified basic building blocks to a grammar writer.</Paragraph> <Paragraph position="6"> Using basic edges the recognition part of an FST is then defined as a regular expression using a functional notation. For example the recognition part for simple nominal phrases might be defined as follows:</Paragraph> <Paragraph position="8"> Thus defined, a nominal phrase is the concatenation of one optional determiner (expressed by the loop operator :star<n, where n starts from 0 and ends by 1), followed by zero or more adjectives followed by a noun.</Paragraph> <Paragraph position="9"> Output description part The output structure of an FST is constructed by collecting together the variables of the recognition part's basic edges followed by some specific construction handlers. In order to support re-usability of FST to other applications, it is important to separate the construction handlers from the FST definition. Therefore, the output description part is realized through a function called BUILD-ITEM which receives as input the edge variables and a symbol denoting the class of the FST. For example, if :np is used as a type name for nominal phrases then the output description of the above NP-recognition part is (build-item :type :np :out (list det adj noun)).</Paragraph> <Paragraph position="10"> The function BUILD-ITEM then discriminates according to the specified type and constructs the desired output to some pre-defined requests (note, that in the above case the variables DET and ADJ might have received no token. In that case their default value NIL is used as an indication of this fact). Using this mechanism it is possible to define or re-define the output structure without changing the whole FST.</Paragraph> <Paragraph position="11"> Special edges There exist some special basic edges namely (:var var), (:current-pos pos) and (:seek name var). The edge (:var var) is used for simply skipping or consuming a token without any checks. The edge :current-pos is used for storing the position of the current token in the variable pos, and the edge :seek is used for calling the FST named name, where var is used as a storage for the output of name. This is similar to the :seek edge known from Augmented Transition Networks with the notably distinction that in our system recursive calls are disallowed. Thus :seek can also be seen as a macro expanding operator. The :seek mechanism is very useful in defining modular grammars, since it allows for a hierarchical definition of finite state grammars, from general to specific constructions (or vice versa). The following example demonstrates the use of these special edges: This FST recognizes expressions like &quot;sp~testens um 14:00 h&quot; (by two o'clock at the latest) with the output description ((:out (:time-rel . &quot;spaet&quot;) (:timeprep. &quot;urn&quot;) (:minute. 0) (:hour. 14)) (:end . 4) (:start. 0) (:type. time-expr)) Interface to TDL The interface to TDL, a typed feature-based language and inference system is also realized through basic edges. TDL allows the user to define hierarchically-ordered types consisting of type constraints and feature constraints, and has been originally developed for supporting high-level competence grammar development.</Paragraph> <Paragraph position="12"> In SMES we are using TDL for two purposes: 1. defining domain-specific type lattices 2. expressing syntactic agreement constraints The first knowledge is used for performing concept-based lexical retrieval (e.g., for extracting word forms which are compatible to a given supertype, or for filtering out lexical readings which are incompatible wrt. a given type), and the second knowledge is used for directing fragment processing and combination, e.g., for filtering out certain un-grammatical phrases or for extracting phrases of certain syntactic type.</Paragraph> <Paragraph position="13"> The integration of TDL and finite state expressions is easily achieved through the definition of basic edges. For example the edge (:mona-cat-type (:and &quot;n .... device&quot;) var) will accept a word form which has been analyzed as a noun and whose lexical entry type identifier is subsumed by &quot;device&quot;. As an example of defining agreement test consider the basic edge (:mona-cat-unify &quot;det&quot; &quot;\[(num %l)(case %2 = gen-val) (gender %3)\]&quot; agr det) which checks whether the current token is a determiner and whether its inflection information (computed by MONA) unifies with the specified constraints (here, it is checked whether the determiner has a genitive reading, where structure sharing is expressed through variables like %1). If so, agr is bound to the result of the unifier and token is bound to det. If in the same FST a similar edge for noun tokens follows which also makes reference to the variable agr, the new value for agr is checked with its old value. In this way, agreement information is propagated through the whole FST.</Paragraph> <Paragraph position="14"> An important advantage of using TDL in this way is that it supports the specification of very compact and modular finite expressions. However, one might argue that using TDL in this way could have dramatic effects on the efficiency of the whole system, if the whole power of TDL would be used. In some sense this is true. However, in our current system we only allow the use of type subsumption which is performed by TDL very efficiently, and constraints used very carefully and restrictively. Furthermore, the TDL interface opens up the possibility of integrating deeper processing components very straightforwardly. null Control parameters In order to obtain flexible control mechanisms for the matching phase it is possible to specify whether an exact match is requested or whether an FST should already succeed when the recognition part matches a prefix of the input string (or suffix, respectively). The prefix matching mechanism is used in conjunction with the Kleene :star and the identity edge :var, to allow for searching the whole input stream for extracting all matching expressions of an FST (e.g., extracting all NP's, or time expressions). For example the following FST extracts all genitive NPs found in the input stream and collects them in a list : tion of extracted fragments is performed by a lexical-driven bidirectional shallow parser which operates on fragment combination patterns FCP which are attached to lexical entries (mainly verbs). We call these lexical entries anchors.</Paragraph> <Paragraph position="15"> The input stream for the shallow parser consists of a double-linked list of all extracted fragments found in some input text, all punctuation tokens and text tokens (like newline or paragraph) and all found anchors (i.e., all other tokens of the input text are ignored). The shallow parser then applies for each anchor its associated FCP. An anchor can be viewed as splitting the input stream into a left and right input part. Application of an FCP then starts directly from the input position of the anchor and searches the left and right input parts for candidate fragments. Searching stops either if the beginning or the end of a text has been reached or if some punctuation, text tokens or other anchors defined as stop markers have been recognized.</Paragraph> <Paragraph position="16"> General form of fragment combination patterns A FCP consists of a unique name, an recognition part applied on the left input part and one for the right input part, an output description part and a set of constraints on the type and number of collected fragments. As an prototypical case, consider the following FCP defined for intransitive verbs like to come or to begin:</Paragraph> <Paragraph position="18"> Additionally, a boolean parameter can be used to specify whether longest or shortest matches should be prefered (the default is longest match, see also (Appelt et al., 1993) where also longest subsuming phrases are prefered).</Paragraph> <Paragraph position="19"> anchor and introduces two sets of constraints, which are used to define restrictions on the type and number of necessary and optional fragments, e.g., the first constraint says that exactly one :np fragment (expressed by the lower and upper bound in (1 1)) in nominative case must be collected, where the second constraint says that at most two optional fragments of type :tmp can be collected. The two constraints are maintained by the basic edges :add-nec and :add-opt. :add-nec performs as follows. If the current token is a fragment of type :np or :name-np then inspect the set named nec and select the constraint set typed :np . If the current token agrees in case (which is tested by type subsumption) then push it to lcompl and reduce the upper bound by 1. Since next time the upper bound is 0 no more fragments will be considered for the set nec. 5 In a similar manner :add-opt is processed.</Paragraph> <Paragraph position="20"> The edges :ignore-token and :ignore-fragment are used to explicitly specify what sort of tokens will not be considered by :add-nec or :add-opt. In other words this means, that each token which is not mentioned in the FCP will stop the application of the FCP on the current input part (left or right).</Paragraph> <Paragraph position="21"> Complex verb constructions In our current system, FCPs are attached to main verb entries.</Paragraph> <Paragraph position="22"> Expressions which contain modal, auxiliary verbs or separated verb prefixes are handled by lexical rules which are applied after fragment processing and before shallow processing. Although this mechanism turned out to be practical enough for our current applications, we have defined also complex verb group fragments VGF. A VGF is applied after fragment processing took place. It collects all verb forms used in a sentence, and returns the underlying dependency-based structure. Such an VGF is then used as a complex anchor for the selection of appropriate fragment combination patterns as described above. The advantage of verb group fragments is that they help to handle more complex constructions (e.g., time or speech act) in a more systematic (but still shallow) way.</Paragraph> <Paragraph position="23"> Template generation An FCP expresses restrictions on the set of candidate fragments to be collected by the anchor. If successful the set of found fragments together with the anchor builds up an instantiated template or frame. In general a template is a record-like structure consisting of features and their values, where each collected fragment and the 5In some sense this mechanism behaves like the sub-categorization principle employed in constraint-based lexical grammars.</Paragraph> <Paragraph position="24"> anchor builds up a feature/value pair. An FCP also defines which sort of fragments are necessary or optional for building up the whole template. FCPs are used for defining linguistically oriented general head-modifier construction (linguistically based on dependency theory) and application-specific database entries. The &quot;shallowness&quot; of the template construction/instantiation process depends on the weakness of the defined FST of an FCP.</Paragraph> <Paragraph position="25"> A major drawback of our current approach is that necessary and optional constraints are defined together in one FCP. For example, if an FCP is used for defining generic clause expressions, where complements are defined through necessary constraints and adjuncts through optional constraints then it has been shown that the constraints on the adjuncts can change for different applications. Thus we actually lack some modularity concerning this issue.</Paragraph> <Paragraph position="26"> A better solution would be to attach optional constraints directly with lexical entries and to &quot;splice&quot; them into an FCP after its selection.</Paragraph> </Section> <Section position="6" start_page="213" end_page="213" type="metho"> <SectionTitle> 6 Coverage of knowledge sources </SectionTitle> <Paragraph position="0"> The lexicon in use contains more than 120.000 stem entries (concerning morpho-syntactic information).</Paragraph> <Paragraph position="1"> The time and date subgrammar covers a wide range of expressions including nominal, prepositional, and coordinated expressions, as well as combined date-time expressions (e.g., &quot;vom 19. (8.00 h) his einschl. 21. Oktober (18.00 h)&quot; yields: (:pp (from :np (day. 19) (hour. 8) (minute. 0)) (to :np (day. 21) (month. 10) (hour. 18) (minute. 0)))) The NP/PP subgrammars cover e.g., coordinate NPs, different forms of adjective constructions, genitive expressions, pronouns. The output structures reflects the underlying head-modifier relations (e.g., &quot; Die neuartige und vielf~ltige Gesellschaft &quot; yields: (((:sere (:head &quot;gesellschaft&quot;) (:mods &quot;neuar- tig .... vielfaeltig&quot;) (:quantifier &quot;d-det&quot;)) (:agr nom-accval) (:end. 6) (:start. 1) (:type. :np))) 30 generic syntactic verb subcategorization frames are defined by fragment combination patterns (e.g, for transitive verb frame). Currently, these verb frames are handled by the shallow parser with no ordering restriction, which is reasonably because German is a language with relative free word order.</Paragraph> <Paragraph position="2"> However, in future work we will investigate the integration of shallow linear precedence constraints.</Paragraph> <Paragraph position="3"> The specification of the current data has been performed on a tagged corpora of about 250 texts (ranging in size from a third to one page) which are about event announcement, appointment scheduling and business news following a bottom-up grammar development approach.</Paragraph> </Section> <Section position="7" start_page="213" end_page="214" type="metho"> <SectionTitle> 7 Current applications </SectionTitle> <Paragraph position="0"> On top of SMES three application systems have been implemented: 1. appointment scheduling via email: extraction of co-operate act, duration, range, appointment, sender, receiver, topic 2. classification of event announcements sent via email: extraction of speaker, title, time, and location 3. extraction of company information from newspaper articles: company name, date, turnover, revenue, quality, difference For these applications the main architecture (as described above), the scanner, morphology, the set of basic edges, the subgrammars for time/date and phrasal expressions could be used basically unchanged. null In (1) SMES is embedded in the COSMA system, a German language server for existing appointment scheduling agent systems (see (Busemann et al., 1997), this volume, for more information). In case (2) additional FST for the text structure have been added, since the text structure is an important source for the location of relevant information. However, since the form of event announcements is usually not standardized, shallow NLP mechanisms are necessary. Hence, the main strategy realized is a mix of text structure recognition and restricted shallow analysis. For application (3), new subgrammars for company names and currency expressions have to be defined, as well as a task-specific reference resolution method.</Paragraph> <Paragraph position="1"> Processing is very robust and fast (between 1 and</Paragraph> </Section> <Section position="8" start_page="214" end_page="215" type="metho"> <SectionTitle> 10 CPU seconds (Sun UltraSparc) depending on the </SectionTitle> <Paragraph position="0"> size of the text which ranges from very short texts (a few sentences) upto short texts (one page)). In all of the three applications we obtained high coverage and good results. Because of the lack of comparable existing IE systems defined for handling German texts in similar domains and the lack of evaluation standards for the German language (comparable to that of MUC), we cannot claim that these results are comparable.</Paragraph> <Paragraph position="1"> However, we have now started the implementation of a new application together with a commercial partner, where a more systematic evaluation of the system is carried out. Here, SMES is applied on a quite different domain, namely news items concerning the German IFOR mission in former Yugoslavia. Our task is to identify those messages which are about violations of the peace treaty and to extract the information about location, aggressor, defender and victims.</Paragraph> <Paragraph position="2"> The corpus consists of a set of monthly reports (Jan. 1996 to Aug. 1996) each consisting of about 25 messages from which 2 to 8 messages are about fighting actions. These messages have been hand-tagged with respect to the relevant information. Although we are still in the development phase we will briefly describe our experience of adapting SMES to this new domain. Starting from the assumption that the core machinery can be used un-changed we first measured the coverage of the existing linguistic knowledge sources. Concerning the above mentioned corpus the lexicon covers about 90%. However, from the 10% of unrecognized words about 70% are proper names (which we will handle without a lexicon) and 1.5% are spelling errors, so that the lexicon actually covers more then 95% of this unseen text corpus.</Paragraph> <Paragraph position="3"> The same &quot;blind&quot; test was also carried out for the date, time, and location subgrammar, i.e., they have been run on the new corpus without any adaption to the specific domain knowledge. For the date-/time expressions we obtained a recall of 77% and a precision of 88%, and for the location expressions we obtained 66% and 87%, respectively. In the latter case, most of the unrecognized expressions concern expressions like &quot;nach Taszar/Ungarn&quot;, &quot;im serbischen bzw. kroatischen Teil Bosniens&quot;, or &quot;in der Moslemisch-kroatischen FSderation&quot;. For the general NP and PP subgrammars we obtained a recall of 55% and a precision of 60% (concerning correct head-modifier structure). The small recall is due to some lexical gap (including proper names) and unforeseen complex expressions like &quot;die Mehrzahl der auf 140.000 gesch~itzten moslemischen Fltichtlinge&quot;. But note that these grammars have been written on the basis of different corpora.</Paragraph> <Paragraph position="4"> In order to measure the coverage of the fragment combination patterns FCP, the relevant main verbs of the tagged corpora have been associated with the corresponding FCP (e.g., the FCP for transitive verbs), without changing the original definition of the FCPs. The only major change to be done concerned the extension of the output description function BUILD-ITEM for building up the new template structure. After a first trial run we obtained an unsatisfactory recognition rate of about 25%. One major problem we identified was the frequent use of passive constructions which the shallow parser was not able to process. Consequently, as a first actual extension of SMES to the new domain we extended the shallow parser to cope with passive constructions. Using this extension we obtained an recognition of about 40% after a new trial run.</Paragraph> <Paragraph position="5"> After the analysis of the (partially) unrecognized messages (including the misclassified ones), we identified the following major bottlenecks of our current system. First, many of the partially recognized templates are part of coordinations (including enumerations), in which case several (local) templates share the same slot, however this slot is only mentioned one time. Resolving this kind of &quot;slot sharing&quot; requires processing of elliptic expressions of different kinds as well as the need of domain-specific inference rules which we have not yet foreseen as part of the core system. Second, the wrong recognition of messages is often due to the lack of semantic constraints which would be applied during shallow parsing in a similar way as the subcategorization constraints.</Paragraph> <Paragraph position="6"> Although these current results should and can be improved we are convinced that the idea of developing a core IE-engine is a worthwhile venture.</Paragraph> </Section> <Section position="9" start_page="215" end_page="215" type="metho"> <SectionTitle> 8 Related work </SectionTitle> <Paragraph position="0"> In Germany, IE based on innovative language technology is still a novelty. The only groups which we are aware of which also consider NLP-based IE are (Hahn, 1992; Bayer et al., 1994). None of them make use of such sophisticated components, as we do in SMES. Our work is mostly influence by the work of (Hobbs, 1992; Appelt et al., 1993; Grishman, 1995) as well as by the work described in (Anderson et al., 1992; Dowding et al., 1993).</Paragraph> </Section> class="xml-element"></Paper>