File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1018_metho.xml
Size: 13,087 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1018"> <Title>Japanese ERR UND OVG SUB REC PR E ALL OBJECT S MATCHED ONLY TEXT FILTERING</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> aonecQ sra.coin BACKGROUN D </SectionTitle> <Paragraph position="0"> SRA used a language-independent, domain-independent, multipurpose text understanding system as the core of the MUC-5 system for extraction from English and Japanese joint venture texts . SRA's NLP core system , SOLOMON, has been under development since 1986 . It has been used for a variety of domains, and wa s aimed from the start to be language-independent, domain-independent, and application-independent . More recently, SOLOMON has been extended to be multilingual, beginning with Spanish in 1990 and Japanese i n 1991 . The Spanish-Japanese text understanding system that uses SOLOMON was developed for a dornai n very different from the MUC-5 joint venture domain (cf . Aone, et al . [2]).</Paragraph> <Paragraph position="1"> SOLOMON's principal applications have been in data extraction, but it is also used in a prototyp e machine translation system (cf. Aone and McKee [5]) . The domain areas in which SOLOMON application s have been developed are : financial, terrorism, medical, and the MUC-5 joint-venture domain . SRA has significantly enhanced its capability to add new domains and languages by developing new strategies fo r data acquisition using both statistical techniques and a variety of user-friendly tools .</Paragraph> </Section> <Section position="3" start_page="0" end_page="212" type="metho"> <SectionTitle> MUC-5 SYSTEM ARCHITECTUR E </SectionTitle> <Paragraph position="0"> SOLOMON employs a modular, data-driven architecture to achieve its language- and domain-independence .</Paragraph> <Paragraph position="1"> The MUC-5 system, which uses SOLOMON as a core engine, consists of seven processing modules an d corresponding data modules, as shown in Figure 1, which will be described in the following sections .</Paragraph> <Section position="1" start_page="0" end_page="207" type="sub_section"> <SectionTitle> Message Zoner </SectionTitle> <Paragraph position="0"> The Message Zoner uploads the SGML-annotated text file into the data extraction system . Input files are assumed to have been proprocessed so that they contain only &quot;rigorous markup&quot; (cf. Goldfarb [8]) SGM L tags and text ; however, we do not require sentences or paragraphs to be tagged. Japanese text is assume d to be encoded in EUC, but tags must be ASCII .</Paragraph> <Paragraph position="1"> All input, including tags, is tokenized using a simple, language-independent, regular expression recognizer . The (multi-word) tokens are parsed into sentences, paragraphs, headers and documents using a simpl e operator-precendence grammar (cf. Aho, Sethi and Ullman [1]) operating on punctuation and tags . The tokenizer and parser are written entirely in lex .</Paragraph> <Paragraph position="2"> Figure 1 : MUC-5 System Architecture Sentence and paragraph boundries are inferred using a conservative algorithm and marked as inferred . Inference is not performed if sentences and paragraphs are rigorously marked . The output is piped to a post-processor, which does a fast lookup of each word in a btree gazetteer, and includes entry information in the tokens of place names .</Paragraph> <Paragraph position="3"> Preprocessin g Preprocessing consists of two processors, the morphological analyzer and the pattern matcher, and associate d data in the form of morphological data, lexicons, and patterns for each language. Its input is a tokenized message, and its output is a series of lexical entries with syntactic and semantic attributes . Declarative morphological data for inflection-rich Japanese and Spanish is compiled into finite-stat e machines . The English domain lexicon was derived from development texts automatically, using a statistica l technique (cf. McKee and Maloney [10]) . This derived lexicon also contains automatically acquired domain specific subcategorization frames and predicate-argument mapping rules called situation types (cf. Aone an d McKee [3]), as shown in Figure 2 .</Paragraph> <Paragraph position="4"> Pattern recognition handles a wide range of phenomena, including multi-words, numbers, acronyms , money, date, person names, locations, and organizations . We extended the Pattern matcher to handle multi-level pattern recognition . The pattern data are divided into ordered multiple groups called priority groups, and the patterns in each group are fired sequentially, avoiding recursive applications as much as possible . This extension speeded up the performance of Preprocessing significantly .</Paragraph> </Section> <Section position="2" start_page="207" end_page="209" type="sub_section"> <SectionTitle> Syntactic Analysis </SectionTitle> <Paragraph position="0"> The processor for Syntactic Analysis is a parser based on Tomita 's algorithm (cf. Tomita [11]), with modifications for disambiguation during parsing . Syntactic Analysis data consist of X-bar based phrase structur e grammars and preparse patterns for each of the three languages, English, Japanese, and Spanish . Syntactic Analysis outputs F-structures (grammatical relations), along the lines of Lexical-Functional Grammar (cf .</Paragraph> <Paragraph position="1"> Bresnan [7]), as shown in Figure 3 . The Semantic Interpretation module is interleaved for disambiguatio n of prepositional phrase attachment, conjunctions, and so on, by calling semantic functions, which are share d by all three languages, from inside the grammar .</Paragraph> <Paragraph position="2"> Preparsing takes the burden off of main parsing and increases accuracy, by recognizing structures such a s sentential complements, appositives, certain PP's, etc . by pattern matching, and sending these to the parse r as chunks. These preparse chunks are parsed prior to main parsing using the same grammars, and their output consists of F-structures as well .</Paragraph> <Paragraph position="3"> In order to test the progress of grammar development and pinpoint trouble spots, automatic evaluatio n of grammars was used . SRA adapted the community-wide program Parseval (cf. Black, et al . [6]) for use in Japanese in addition to English . Testing on Japanese was limited, since there are not many brackete d Japanese texts to use as answer keys .</Paragraph> <Paragraph position="4"> Semantic Interpretatio n Semantic Interpretation uses a language-independent processing module, and its data are predicate-argumen t mapping rules for each verb, plus both core and domain knowledge bases . Semantic Interpretation work s off of language-neutral F-structures in order to handle all the languages. It outputs semantic structures, i .e. predicate-argument and modification relations, as shown in Figure 4 . The predicate-argument mapping rule s (i .e. rules which map F-structures to semantic structures) are acquired automatically (cf . Aone and McKee [3]) . Domain knowledge bases, on the other hand, were acquired manually . However, a new rapid knowledg e acquisition tool called KATooI was used to link a lexical entry to its corresponding semantic concept in th e knowledge bases (cf. Figure 5) .</Paragraph> <Paragraph position="5"> If a full parse cannot be created, SOLOMON uses a fragment combination strategy . Debris Parsing and its subsequent process, Debris Semantics, work together to obtain the best interpretation from sentence fragments. They use as data the grammars and knowledge bases, and they output semantic structures jus t like when a full parse is created . Debris Parsing retrieves the largest and most preferred constituents from the parse stack . It then reparses the rest of the input, and creates debris F-structures with the best fragmen t constituents. Debris Semantics relies on the semantic interpreter to process each fragment, and then fit s fragments together using semantic constraints on unfilled slots .</Paragraph> </Section> <Section position="3" start_page="209" end_page="212" type="sub_section"> <SectionTitle> Discourse Analysis </SectionTitle> <Paragraph position="0"> Discourse Analysis, which was redesigned and implemented this year (cf . Aone and McKee [4]), performs reference resolution . Discourse Analysis uses a data-driven architecture to achieve language-independence , domain-independence, and extensibility . It employs a single language-independent, domain-independen t processor, and several discourse knowledge bases, some of which are shared among different languages . The output, of Discourse Analysis is a set of semantic structures with coreference links added, i .e. File Cards (cf. Heim [9]). Discourse phenomena handled for the joint venture domain include name anaphora (e .g.</Paragraph> <Paragraph position="2"> DISCOURSE : Classified $<DISCOURSE-MARKER DISCOURSE-MARKER-181>(&quot;BRIDGESTONE SPORTS&quot;) as DP-NAM E DISCOURSE : Found an exact match , ante: $(DISCOURSE-MARKER DISCOURSE-MARKER-83>(&quot;BRIDGESTONE SPORTS CO .&quot;) ref: $<DISCOURSE-MARKER DISCOURSE-MARKER-181>(&quot;BRIDGESTONE SPORTS&quot; ) DISCOURSE : Classified $<DISCOURSE-MARKER DISCOURSE-MARKER-206>(&quot;BRIDGESTONE SPORTS&quot;) as DP-NAM E DISCOURSE : Found an exact match , ante: $<DISCOURSE-MARKER DISCOURSE-MARKER-181>(&quot;BRIDGESTONE SPORTS&quot; ) ref: $(DISCOURSE-MARKER DISCOURSE-MARKER-206>(&quot;BRIDGESTONE SPORTS&quot; ) Figure 6 : English Discourse Trace Exampl e</Paragraph> <Paragraph position="4"> The system traces for English and Japanese walkthrough examples are shown in Figure 6 and Figure 7 .</Paragraph> <Paragraph position="5"> In the English example, the two instances of name anaphora for &quot;Bridgestone Sports Co.&quot; are recognized , while in the Japanese example, all the references to &quot;Tokyo Kaijou Kasai Hoken, &quot; including appositives, ar e resolved .</Paragraph> <Paragraph position="6"> Pragmatic Inferencin g Pragmatic Inferencing performs reasoning in order to derive implicit information from the text, using a forward chainer and inference rules . Pragmatic Inferencing outputs semantic structures, with inferred inforinat ion added . It infers additional information from &quot;literal&quot; meanings as required for application domains . For instance, in the walkthrough example, in order to infer &quot;THE TAIWAN UNIT &quot; is a joint venture company frorr, the phrase &quot;THE ESTABLISHMENT OF THE TAIWAN UNIT&quot; the following rule is used . It is easy for developers to add, change or remove inferred information due to the declarative nature o f the inference rules . For instance, to get an additional tie-up from &quot;Company A and Company B tied wit h The Extract module performs template generation, translating the domain-relevant portions of our language independent semantic structures into database records . We maintain a strong distinction between processin g and data even in template generation . Thus, we use the same processing module to output in differen t languages and to several database schemata, including to a flat template-style schema as in MUC-4 and t o a more object-oriented schema as in MUC-5 .</Paragraph> <Paragraph position="7"> To do the actual template filling, we rely on Extract data made up of kb-object/slot to db-table/fiel d mapping rules and conversion functions for the individual values (e .g. set fills, string fills) . For example, th e #nationality slot of an #ORGANIZATION object in our knowledge base corresponds to the Nationalit y field of the Entity object in the MUC-5 template .</Paragraph> </Section> </Section> <Section position="4" start_page="212" end_page="213" type="metho"> <SectionTitle> REUSABILITY OF THE SYSTE M </SectionTitle> <Paragraph position="0"> SOLOMON is designed for reusability . Each processing module is data-driven and reusable in other languages and other domains, as well as in applications other than data extraction (e .g. machine translation , abstracting, summarization) . A large portion of the data is also reusable in : - Some of the discourse knowledge sources - Inference rules - Extract (template generation) dat a The data acquisition tools and techniques are also reusable in other languages and domains . The statistical techniques used to derive lexical information can be reused for other domains . LEXTooI, the lexicon acquisition tool, is multilingual and relies on system data files for category and morphological information. KBTooI, the knowledge base acquisition tool, is language-independent just as the knowledge bases ar e language-independent . KATool, the knowledge acquisition tool that links lexicon entries with the appropriate knowledge base concepts, is entirely data-driven as well, and is therefore completely reusable . Figure 8 summarizes the reusability of SRA 's MUC-5 system.</Paragraph> </Section> class="xml-element"></Paper>